WO2021232913A1 - 语音信息处理方法及装置、存储介质和电子设备 - Google Patents
语音信息处理方法及装置、存储介质和电子设备 Download PDFInfo
- Publication number
- WO2021232913A1 WO2021232913A1 PCT/CN2021/081332 CN2021081332W WO2021232913A1 WO 2021232913 A1 WO2021232913 A1 WO 2021232913A1 CN 2021081332 W CN2021081332 W CN 2021081332W WO 2021232913 A1 WO2021232913 A1 WO 2021232913A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- voice information
- information
- assistant
- trigger event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/4401—Bootstrapping
- G06F9/4418—Suspend and resume; Hibernate and awake
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0481—Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
- G06F3/04815—Interaction with a metaphor-based environment or interaction object displayed as three-dimensional [3D], e.g. changing the user viewpoint with respect to the environment or object
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44505—Configuring for program initiating, e.g. using registry, configuration files
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/451—Execution arrangements for user interfaces
- G06F9/453—Help systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the present disclosure relates to the technical field of voice processing, and in particular, to a voice information processing method, a voice information processing device, a computer-readable storage medium, and electronic equipment.
- smart glasses combine computer technology with traditional glasses to achieve rich functions. These functions include human-computer interaction through voice, and the introduction of voice processing has greatly improved the convenience of using smart glasses.
- the voice processing process relies on the processor equipped with smart glasses.
- the smart glasses cannot encounter problems of poor recognition and poor interaction effects when performing voice processing. If smart glasses are equipped with high-performance processors, it will increase the cost of smart glasses, which is unacceptable for ordinary users.
- a voice information processing method which is applied to a first device, and the voice information processing method includes: acquiring first voice information, and if the first voice information contains a wake-up keyword, sending a message to the second The device sends the voice assistant wake-up instruction so that the second device starts the voice assistant; acquires the second voice information, and sends the second voice information to the second device so that the second device can use the voice assistant to determine the voice trigger event.
- Voice information correspondence receiving target information fed back by the second device, and executing a voice trigger event based on the target information.
- a voice information processing method applied to a second device includes: responding to a voice assistant wake-up instruction to start the voice assistant; wherein the voice assistant wake-up instruction is issued by the first device When it is determined that the first voice information contains the wake-up keyword, it is sent to the second device; the second voice information sent by the first device is obtained, and the voice assistant is used to determine the voice trigger event corresponding to the second voice information; The target information associated with the trigger event is fed back to the first device, so that the first device executes the voice trigger event based on the target information.
- a voice information processing device applied to a first device, including: a wake-up trigger module configured to obtain first voice information, and if the first voice information contains a wake-up keyword, The second device sends the voice assistant wake-up instruction so that the second device starts the voice assistant; the voice sending module is configured to obtain the second voice information, and send the second voice information to the second device, so that the second device uses the voice assistant to determine the voice A trigger event, the voice trigger event corresponds to the second voice information; the event execution module is configured to receive the target information fed back by the second device, and execute the voice trigger event based on the target information.
- a voice information processing device applied to a second device, including: a voice assistant activation module configured to respond to a voice assistant wake-up command to activate the voice assistant; wherein the voice assistant wake-up command is controlled by the voice assistant
- the first device sends to the second device when it determines that the first voice information contains the wake-up keyword;
- the event determination module is configured to obtain the second voice information sent by the first device, and use the voice assistant to determine the second voice information Corresponding voice trigger event;
- an information feedback module configured to feed back target information associated with executing the voice trigger event to the first device, so that the first device executes the voice trigger event based on the target information.
- a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the above-mentioned voice information processing method is realized.
- an electronic device including a processor; a memory, configured to store one or more programs, and when the one or more programs are executed by the processor, the processor realizes the foregoing Voice information processing method.
- FIG. 1 shows a schematic diagram of an exemplary system architecture to which a voice information processing solution according to an embodiment of the present disclosure is applied;
- Figure 2 shows a schematic structural diagram of an electronic device suitable for implementing embodiments of the present disclosure
- Fig. 3 schematically shows a flowchart of a voice information processing method applied to a first device according to an exemplary embodiment of the present disclosure
- Fig. 4 schematically shows a flowchart of a voice information processing method applied to a second device according to an exemplary embodiment of the present disclosure
- Fig. 5 schematically shows a device interaction diagram for implementing a voice information processing process according to an exemplary embodiment of the present disclosure
- FIG. 6 schematically shows an overall software and hardware architecture diagram of a voice information processing solution of an embodiment of the present disclosure
- Fig. 7 schematically shows a block diagram of a voice information processing apparatus applied to a first device according to an exemplary embodiment of the present disclosure
- Fig. 8 schematically shows a block diagram of a voice information processing apparatus applied to a first device according to another exemplary embodiment of the present disclosure
- FIG. 9 schematically shows a block diagram of a voice information processing apparatus applied to a first device according to another exemplary embodiment of the present disclosure.
- FIG. 10 schematically shows a block diagram of a voice information processing apparatus applied to a second device according to an exemplary embodiment of the present disclosure
- Fig. 11 schematically shows a block diagram of a voice information processing apparatus applied to a second device according to another exemplary embodiment of the present disclosure.
- FIG. 1 shows a schematic diagram of an exemplary system architecture to which a voice information processing solution according to an embodiment of the present disclosure is applied.
- the system architecture may include a first device 1100 and a second device 1200.
- the first device 1100 may be a device that receives voice information and executes events corresponding to the voice information.
- the first device 1100 may be AR (Augmented Reality) glasses, VR (Virtual Reality) glasses, or MR (Mixed Reality, Mixed Reality) glasses, however, the first device 1100 may also be other wearable devices with a display function, such as a smart helmet.
- the second device 1200 may be a device configured with a voice assistant.
- the voice assistant analyzes the voice information received from the first device 1100, determines the event corresponding to the voice information, and feeds the relevant information back to the first device 1100, so that The first device 1100 executes the event corresponding to the voice information.
- the second device 1200 may be a mobile phone, a tablet, a personal computer, or the like.
- the first device 1100 obtains the first voice information, performs keyword recognition on the first voice information, and determines whether the first voice information includes a wake-up keyword. In the case that the first voice message contains the wake-up keyword, the first device 1100 sends a voice assistant wake-up instruction to the second device 1200. The second device 1200 responds to the voice assistant wake-up instruction to start the voice assistant installed on the second device 1200.
- the first device 1100 can send the second voice information to the second device 1200, and the second device 1200 can use the voice assistant to determine the voice trigger event corresponding to the second voice information, and
- the target information associated with the voice trigger event is sent to the first device 1100, and the first device 1100 may execute the voice trigger event based on the target information.
- the process of determining the voice trigger event by the second device 1200 may be implemented only by the second device 1200, that is, the second device 1200 uses a voice assistant to analyze the second voice information (including but not limited to: Voice recognition, semantic recognition, speech synthesis, etc.), and determine the voice trigger event based on the analysis result.
- a voice assistant to analyze the second voice information (including but not limited to: Voice recognition, semantic recognition, speech synthesis, etc.), and determine the voice trigger event based on the analysis result.
- the architecture that implements the voice information processing process of the present disclosure may further include a server 1300.
- the second device 1200 may send the second voice information to the server 1300, and the server 1300 may send the second voice information to the server 1300.
- the information is analyzed, and the analysis result is fed back to the second device 1200.
- the voice trigger events may include adjusting the volume of the AR glasses, checking the weather, making calls, setting schedules, and recording. Screen, screenshot, open/close album, open/close designated application, shutdown, etc.
- the second device 1200 may send the voice assistant and/or the user interface (UI) of the voice trigger event to the first device 1100 for display.
- UI user interface
- the user interface that the second device 1200 sends to the first device 1100 for display by the first device 1100 may be different from the user interface displayed by the second device 1200 itself.
- the user interface sent by the second device 1200 to the first device 1100 may be a three-dimensional interface image rendered by the second device 1200 or the server 1300, and the user interface displayed by the second device 1200 itself is usually a two-dimensional interface, and There are also differences in the interface layout and content of the two.
- the first device 1100 is, for example, AR glasses
- a three-dimensional effect can be presented for the user to view.
- Fig. 2 shows a schematic diagram of an electronic device suitable for implementing exemplary embodiments of the present disclosure.
- the first device and/or the second device in the exemplary embodiment of the present disclosure may be configured as shown in FIG. 2. It should be noted that the electronic device shown in FIG. 2 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
- the electronic device of the present disclosure includes at least a processor and a memory.
- the memory is configured to store one or more programs.
- the processor can implement the voice information processing of the exemplary embodiment of the present disclosure. method.
- the electronic device 200 may include: a processor 210, an internal memory 221, an external memory interface 222, a universal serial bus (USB) interface 230, a charging management module 240, and a power management module 241, battery 242, antenna 1, antenna 2, mobile communication module 250, wireless communication module 260, audio module 270, speaker 271, receiver 272, microphone 273, earphone interface 274, sensor module 280, display screen 290, camera module 291 , Indicator 292, motor 293, button 294, Subscriber Identification Module (SIM) card interface 295, etc.
- SIM Subscriber Identification Module
- the sensor module 280 may include a depth sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, and a bone conduction sensor.
- the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 200.
- the electronic device 200 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components.
- the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
- the processor 210 may include one or more processing units.
- the processor 210 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (Image Signal Processor, ISP), controller, video codec, digital signal processor (Digital Signal Processor, DSP), baseband processor and/or neural network processor (Neural-etwork Processing Unit, NPU), etc.
- AP application processor
- GPU graphics processing unit
- ISP Image Signal Processor
- controller video codec
- digital signal processor Digital Signal Processor
- DSP Digital Signal Processor
- NPU neural network processor
- the different processing units may be independent devices or integrated in one or more processors.
- a memory may be provided in the processor 210 to store instructions and data.
- the USB interface 230 is an interface that complies with the USB standard specification, and specifically may be a MiniUSB interface, a MicroUSB interface, a USBTypeC interface, and so on.
- the USB interface 230 can be used to connect a charger to charge the electronic device 200, and can also be used to transfer data between the electronic device 200 and peripheral devices. It can also be used to connect earphones and play audio through earphones. This interface can also be used to connect other electronic devices, such as AR devices.
- the charging management module 240 is used to receive charging input from the charger.
- the charger can be a wireless charger or a wired charger.
- the power management module 241 is used to connect the battery 242, the charging management module 240, and the processor 210.
- the power management module 241 receives input from the battery 242 and/or the charging management module 240, and supplies power to the processor 210, the internal memory 221, the display screen 290, the camera module 291, and the wireless communication module 260.
- the wireless communication function of the electronic device 200 can be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modem processor, and the baseband processor.
- the mobile communication module 250 can provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the electronic device 200.
- the wireless communication module 260 can provide wireless local area networks (Wireless Local Area Networks, WLAN) (such as Wireless Fidelity (Wi-Fi) networks), Bluetooth (Bluetooth, BT), and global navigation satellites used on the electronic device 200.
- WLAN Wireless Local Area Networks
- Wi-Fi Wireless Fidelity
- Bluetooth Bluetooth
- BT Bluetooth
- global navigation satellites used on the electronic device 200.
- System Global Navigation Satellite System, GNSS
- FM Frequency Modulation
- NFC Near Field Communication
- Infrared Technology Infrared, IR
- the electronic device 200 implements a display function through a GPU, a display screen 290, an application processor, and the like.
- the GPU is a microprocessor for image processing and is connected to the display screen 290 and the application processor.
- the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
- the processor 210 may include one or more GPUs that execute program instructions to generate or change display information.
- the electronic device 200 can implement a shooting function through an ISP, a camera module 291, a video codec, a GPU, a display screen 290, and an application processor.
- the electronic device 200 may include 1 or N camera modules 291, and N is a positive integer greater than 1. If the electronic device 200 includes N cameras, one of the N cameras is the main camera.
- the internal memory 221 may be used to store computer executable program code, where the executable program code includes instructions.
- the internal memory 221 may include a storage program area and a storage data area.
- the external memory interface 222 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the electronic device 200.
- the electronic device 200 can implement audio functions through an audio module 270, a speaker 271, a receiver 272, a microphone 273, a headphone interface 274, an application processor, and the like. For example, music playback, recording, etc.
- the audio module 270 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
- the audio module 270 can also be used to encode and decode audio signals.
- the audio module 270 may be provided in the processor 210, or part of the functional modules of the audio module 270 may be provided in the processor 210.
- the speaker 271 also called a "speaker” is used to convert audio electrical signals into sound signals.
- the electronic device 200 can listen to music through the speaker 271, or listen to a hands-free call.
- the microphone 273, also called “microphone” or “microphone”, is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can approach the microphone 273 through the mouth to make a sound, and input the sound signal to the microphone 273.
- the electronic device 200 may be provided with at least one microphone 273.
- the earphone interface 274 is used to connect wired earphones.
- the depth sensor is used to obtain depth information of the scene.
- the pressure sensor is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
- the gyro sensor can be used to determine the movement posture of the electronic device 200.
- the air pressure sensor is used to measure air pressure.
- the magnetic sensor includes a Hall sensor.
- the electronic device 200 may use a magnetic sensor to detect the opening and closing of the flip cover.
- the acceleration sensor can detect the magnitude of the acceleration of the electronic device 200 in various directions (generally three axes).
- the distance sensor is used to measure distance.
- the proximity light sensor may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode.
- the fingerprint sensor is used to collect fingerprints.
- the temperature sensor is used to detect temperature.
- the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
- the visual output related to the touch operation can be provided through the display screen 290.
- the ambient light sensor is used to sense the brightness of the ambient light. Bone conduction sensors can acquire vibration signals.
- the button 294 includes a power-on button, a volume button, and so on.
- the button 294 may be a mechanical button. It can also be a touch button.
- the motor 293 can generate vibration prompts. The motor 293 can be used for incoming call vibration notification, and can also be used for touch vibration feedback.
- the indicator 292 can be an indicator light, which can be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.
- the SIM card interface 295 is used to connect to the SIM card.
- the electronic device 200 interacts with the network through the SIM card to implement functions such as call and data communication.
- the present application also provides a computer-readable storage medium.
- the computer-readable storage medium may be included in the electronic device described in the above embodiment; or it may exist alone without being assembled into the electronic device.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable removable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
- the computer-readable storage medium can send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device.
- the program code contained on the computer-readable storage medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.
- the computer-readable storage medium carries one or more programs, and when the above one or more programs are executed by an electronic device, the electronic device realizes the method described in the following embodiments.
- each block in the flowchart or block diagram may represent a module, program segment, or part of the code, and the above-mentioned module, program segment, or part of the code contains one or more for realizing the specified logic function.
- Executable instructions may also occur in a different order from the order marked in the drawings. For example, two blocks shown one after another can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
- each block in the block diagram or flowchart, and the combination of blocks in the block diagram or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations, or can be implemented by It is realized by a combination of dedicated hardware and computer instructions.
- the units described in the embodiments of the present disclosure may be implemented in software or hardware, and the described units may also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
- Fig. 3 schematically shows a flowchart of a voice information processing method applied to a first device according to an exemplary embodiment of the present disclosure.
- the voice information processing method applied to the first device may include the following steps:
- the first device is equipped with a voice recording device.
- the first device may be equipped with a single microphone or a microphone array to obtain voice information around the first device.
- the first device is also equipped with a DSP (Digital Signal Process) chip, which is configured to analyze the first voice information in the case of acquiring the first voice information to determine whether the first voice information contains a wake-up keyword.
- the wake-up keyword is used to wake up the voice assistant on the second device.
- the first device is usually equipped with a sound card.
- both the voice recording device and the DSP chip of the first device can be connected to the sound card, and the first voice information acquired by the voice recording device is sent to the DSP chip through the sound card.
- the sound card needs to be activated to enable the DSP chip to obtain the first voice information.
- the voice recording device equipped with the first device can be directly connected to the DSP chip. Therefore, the process of acquiring the first voice information by the DSP chip can be realized without relying on the activation of the sound card.
- the voice recording device of the first device can obtain the first voice information, and the DSP chip can analyze the first voice information to determine whether the first voice information contains a wake-up keyword, and the first voice information contains a wake-up keyword.
- the first device may send a voice assistant wake-up instruction to the second device, and the second device may start the voice assistant in response to the voice assistant wake-up instruction.
- the first device and the second device are connected via USB. This wired connection can avoid the problem of frame loss in information transmission and improve the accuracy of voice processing.
- the first device may also be connected to the second device through Bluetooth or WiFi, which is not limited in the present disclosure.
- the voice assistant of the second device is in a closed state, and the entire system is in a semi-sleep state, which helps to reduce the power consumption of the system.
- the second device can send the three-dimensional interface image of the voice assistant to the first device, so that the three-dimensional interface image of the voice assistant can be displayed on the display end of the first device.
- the user can The three-dimensional interface image of the voice assistant is seen on the lens of the AR glasses to remind the user that the voice assistant has been turned on and the voice command can be entered.
- the three-dimensional interface image of the semantic assistant may be rendered and generated by the second device. Specifically, it may be an image generated by rendering after responding to a wake-up instruction of the voice assistant. It may also be a pre-rendered, generated and stored image, and an image that is adjusted and retrieved in response to a wake-up instruction of the voice assistant.
- the present disclosure does not limit the rendering method, time, etc.
- the microphone of the first device can acquire the second voice information and send the second voice information to the second device. It should be understood that, when the voice assistant of the second device is turned on, the first device can control the DSP chip for wake-up keyword recognition to be in a dormant state.
- the second device can obtain the second voice information by way of USB redirection. That is, the voice recording device of the first device is redirected to the voice input of the second device, and the voice assistant of the second device can monitor the voice recorded by the first device.
- the USB insertion event of the first device can be monitored through the service matching the first device in the second device. It should be understood that the service only responds to the USB insertion event of the first device. It does not respond to the USB insertion event of other devices; next, the second device can reset its own voice recording device to the voice recording device of the first device through a callback function.
- the AR service of the mobile phone can monitor the USB insertion event of the AR glasses and provide security verification; next, the mobile phone can record its voice The device is reset to the microphone of the AR glasses; then, the user directly records the voice information through the microphone of the AR glasses, and the voice information will be directly transmitted to the voice recording device of the mobile phone, and then passed to the upper layer application ( That is, the voice assistant).
- the second device may use a voice assistant to analyze the second voice information (including but not limited to operations such as voice recognition, semantic recognition, and voice synthesis) to determine the voice corresponding to the second voice information.
- trigger event For example, the analyzed voice trigger event is "turn down the volume”; another example, the analyzed voice trigger event is "shut down”; another example, the analyzed voice trigger event is "open album", and so on.
- the second device is not connected to the Internet (that is, not connected to the server).
- the second device may use the voice assistant to send the second voice information to the server, the server analyzes the second voice information, determines the voice trigger event, and feeds the voice trigger event back to the second equipment.
- the server can realize dynamic expansion services according to the actual needs of the business.
- the server may determine the voice trigger event based on the first voice processing procedure.
- the first voice processing process may include: first, the server performs voice recognition on the second voice information to convert the second voice information into text information; next, the server may perform semantic recognition on the text information to determine the text information corresponding And generate a semantic recognition result, where the semantic recognition result can be presented in the form of text; then, the server determines the voice trigger event corresponding to the second voice information according to the semantic recognition result.
- the voice trigger event may be, for example, an event for controlling the first device. For example, controlling the first device to shut down.
- the server may determine the voice trigger event based on the second voice processing procedure.
- the second voice processing process may include: first, the server performs voice recognition on the second voice information to convert the second voice information into text information; next, the server may perform semantic recognition on the text information to obtain a voice recognition result; Then, the server may determine the response text corresponding to the second voice information according to the voice recognition result, and convert the response text into the third voice information as the information contained in the voice trigger event.
- the second voice information may be information for inquiring about the weather
- the corresponding voice trigger event may be a voice broadcast of weather conditions
- the above response text refers to real-time weather information determined in the form of text
- the third voice information It is the voice information generated by the text-to-speech process of the response text, such as "sunny day", "light rain” and other real-time weather-related voice information, which can correspond to the following target information fed back to the second device.
- the above voice analysis process is only an example, and the present disclosure does not limit the specific process; on the other hand, the above voice analysis process can also be implemented on the second device. In this case, The solution may not require the participation of the server.
- S36 Receive target information fed back by the second device, and execute a voice trigger event based on the target information.
- the second device After the second device determines the voice trigger event corresponding to the second voice information, the second device can determine the target information associated with the execution of the voice trigger event, and send the target information to the first device so that the first device The device executes a voice trigger event based on the target information.
- the target information may be an instruction to control the state of the first device, or may be information related to displaying content corresponding to the voice trigger event.
- the target information may be an instruction to lower the volume of the first device, and the first device may lower its own volume after receiving the instruction.
- the target information can include the information of each image in the album, so that after the first device receives the target information, it can display the contents of the album on the display of the first device. image.
- photos contained in an album can be displayed on the lenses of AR glasses.
- the target information may be an instruction to control the shutdown of the first device, and the first device may shut down after receiving the instruction.
- the voice assistant of the second device can integrate the mapping relationship between the voice trigger event and the local system command, so that when the voice trigger event is determined, these system commands can be used to control the first device.
- an exemplary embodiment of the present disclosure further includes a solution of sending a three-dimensional interface image corresponding to the voice trigger event to the first device.
- the second device may determine the three-dimensional interface image corresponding to the voice trigger event, and send the three-dimensional interface image to the first device, so that the first device can display the three-dimensional interface image on the display end.
- the user can use the AR
- the three-dimensional interface image corresponding to "turn down the volume" is seen on the lens of the glasses.
- the three-dimensional interface image corresponding to the voice trigger event is rendered and generated by the second device, and the present disclosure does not limit the rendering method, time, etc. It is understandable that the three-dimensional interface image described in the present disclosure may include one or a combination of text, symbols, static pictures, dynamic pictures, and videos.
- the first device after the first device sends a voice assistant wake-up instruction to the second device, the first device starts timing and goes through a predetermined period of time. After a period of time (for example, 20 seconds), if the second voice information is not obtained, the first device sends a voice assistant shutdown instruction to the second device, and the second device may respond to the voice assistant shutdown instruction to turn off the voice assistant.
- a period of time for example, 20 seconds
- the second device may also time the clock by itself, and after a predetermined period of time, if the second voice information is not obtained, the second device will turn off the voice assistant by itself.
- Exemplary embodiments of the present disclosure also provide a flowchart of a voice information processing method applied to a second device.
- the voice information processing method applied to the second device may include the following steps:
- step S42 to step S46 has been described in detail in the above step S32 to step S36, and will not be repeated here.
- FIG. 5 a device interaction diagram for implementing a voice processing process according to an embodiment of the present disclosure will be described.
- step S502 the first device obtains the first voice information, and determines whether the first voice information contains a wake-up keyword through the equipped DSP chip, where the wake-up keyword can be a user-defined keyword, for example, "small cloth”.
- the first device may send a voice assistant wake-up instruction to the second device.
- step S506 the second device activates the voice assistant.
- step S508 the second device may send the three-dimensional interface image of the voice assistant to the first device.
- the first device may display a three-dimensional interface image of the voice assistant on its display terminal.
- the three-dimensional interface image of the voice assistant can be displayed on the lens of AR glasses.
- Steps S502 to S510 exemplarily describe the process of starting the voice assistant of the second device through the wake-up service of the first device.
- the process of controlling the first device by voice will be described below with reference to steps S512 to S528.
- step S512 the first device obtains the second voice information, and sends the second voice information to the second device.
- step S514 the second device sends the second voice information to the server.
- step S516 the server analyzes the second voice information to obtain an analysis result.
- step S518, the server feeds back the analysis result of the second voice information to the second device.
- step S520 the second device determines the voice trigger event according to the analysis result of the server.
- step S522 the second device sends target information to the first device, and the target information is associated with the execution of the voice trigger event, that is, the target information is information necessary for the first device to be able to execute the voice trigger event.
- step S524 the first device executes a voice trigger event.
- step S526 the second device sends a three-dimensional interface image corresponding to the voice trigger event to the first device.
- step S528 the first device displays a three-dimensional interface image corresponding to the voice trigger event.
- the first device is AR glasses
- the second device is a mobile phone.
- the AR glasses and the mobile phone are connected via USB
- the mobile phone and the server can be connected via 3G, 4G, 5G, WiFi and other methods.
- the operating system on it can be RTOS (Real Time Operating System), and the AR glasses themselves are equipped with a DSP chip to provide wake-up services.
- RTOS Real Time Operating System
- the keyword recognition engine is used to determine whether the first voice information includes a wake-up keyword.
- the AR glasses can present the three-dimensional user interface of the voice assistant rendered by the mobile phone.
- an AR software platform for example, ARCore or ARKit
- a voice assistant application can be configured, and the voice assistant APP can be started in response to instructions generated by the wake-up service of the AR glasses.
- the microphone of the AR glasses can be reset, and in addition, it can also correspond to the user interface of the voice assistant displayed on the AR glasses.
- the voice software development kit provided by the voice assistant APP in the mobile phone can be used to interact with the voice semantic analysis engine of the server to send the second voice information to the server. Information is analyzed, and the analysis results are fed back.
- FIG. 6 is only an example and should not be regarded as a limitation of the present disclosure.
- this exemplary embodiment also provides a voice information processing apparatus applied to the first device.
- FIG. 7 schematically shows a block diagram of a voice information processing apparatus applied to a first device according to an exemplary embodiment of the present disclosure.
- the voice information processing apparatus 7 applied to the first device according to an exemplary embodiment of the present disclosure may include a wake-up trigger module 71, a voice transmission module 73 and an event execution module 75.
- the wake-up trigger module 71 may be configured to obtain the first voice information, and if the first voice information contains a wake-up keyword, it sends a voice assistant wake-up instruction to the second device, so that the second device starts the voice assistant; the voice sending module 73 It may be configured to acquire the second voice information, and send the second voice information to the second device, so that the second device uses the voice assistant to determine the voice trigger event corresponding to the second voice information; the event execution module 75 may be configured to receive The second device feeds back the target information associated with the execution of the voice trigger event, and executes the voice trigger event based on the target information.
- the voice information processing device 8 may further include an image display module 81.
- the image display module 81 may be configured to execute: after the second device starts the voice assistant, it receives a three-dimensional interface image of the voice assistant; wherein the three-dimensional interface image of the voice assistant is rendered and generated by the second device; The display terminal displays the three-dimensional interface image of the voice assistant.
- the first device and the second device are connected via USB.
- the voice sending module 73 may be configured to perform: obtain the second voice information; Second, the voice information is sent to the second device.
- the image display module 81 may be further configured to perform: receiving a three-dimensional interface image corresponding to a voice trigger event; wherein the three-dimensional interface image corresponding to the voice trigger event is rendered and generated by the second device; The display terminal of the first device displays a three-dimensional interface image corresponding to the voice trigger event.
- the voice information processing device 9 may further include a shutdown trigger module 91.
- the shutdown trigger module 91 may be configured to execute: after sending the voice assistant wake-up instruction to the second device, start timing; after a predetermined period of time, if the second voice information is not obtained, then send voice to the second device Assistant close instruction, so that the second device closes the voice assistant.
- this exemplary embodiment also provides a voice information processing apparatus applied to a second device.
- FIG. 10 schematically shows a block diagram of a voice information processing apparatus applied to a second device according to an exemplary embodiment of the present disclosure.
- the voice information processing apparatus 10 applied to the second device according to an exemplary embodiment of the present disclosure may include a voice assistant activation module 101, an event determination module 103, and an information feedback module 105.
- the voice assistant activation module 101 may be configured to respond to a voice assistant wake-up instruction and start the voice assistant; wherein, the voice assistant wake-up instruction is sent by the first device to the second device when it is determined that the first voice message contains a wake-up keyword.
- the event determination module 103 may be configured to obtain the second voice information sent by the first device, and use the voice assistant to determine the voice trigger event corresponding to the second voice information; the information feedback module 105 may be configured to perform the voice trigger event The associated target information is fed back to the first device, so that the first device executes a voice trigger event based on the target information.
- the information feedback module 105 may also be configured to execute: after starting the voice assistant, send a three-dimensional interface image of the voice assistant to the first device, so that the first device displays the voice assistant’s information on the display end. Three-dimensional interface image; wherein, the three-dimensional interface image of the voice assistant is rendered and generated by the second device.
- the first device and the second device are connected via USB.
- the process of obtaining the second voice information sent by the first device by the event determination module 103 may be configured to execute:
- the second voice information is obtained from the first device in a redirection manner.
- the event determination module 103 may also be configured to execute: use a voice assistant to send the second voice information to the server, so that the server determines the voice trigger event corresponding to the second voice information; the acquisition is determined by the server A voice trigger event corresponding to the second voice information.
- the information feedback module 105 may also be configured to execute: after determining a voice trigger event corresponding to the second voice information using the voice assistant, send a three-dimensional interface corresponding to the voice trigger event to the first device Image so that the first device displays the three-dimensional interface image corresponding to the voice trigger event on the display terminal; wherein, the three-dimensional interface image corresponding to the voice trigger event is rendered and generated by the second device.
- the voice information processing device 11 may further include a voice assistant closing module 111.
- the voice assistant closing module 111 may be configured to execute: closing the voice assistant in response to a voice assistant closing instruction.
- the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
- a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
- modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
- the features and functions of two or more modules or units described above may be embodied in one module or unit.
- the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
一种语音信息处理方法、语音信息处理装置、计算机可读存储介质和电子设备,涉及语音处理技术领域。该语音信息处理方法包括:第一设备(1100)获取第一语音信息,在第一语音信息包含唤醒关键词的情况下,第一设备(1100)向第二设备(1200)发送语音助手唤醒指令,以便第二设备(1200)启动语音助手。之后,第一设备(1100)获取第二语音信息,并将第二语音信息发送给第二设备(1200),第二设备(1200)利用语音助手确定与第二语音信息对应的语音触发事件,并将与执行语音触发事件相关联的目标信息反馈给第一设备(1100),以便第一设备(1100)基于该目标信息执行语音触发事件。上述方法可以减轻第一设备(1100)的运算压力。
Description
相关申请的交叉引用
本申请要求于2020年05月18日提交的申请号为202010419583.7、名称为“语音信息处理方法及装置、存储介质和电子设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。
本公开涉及语音处理技术领域,具体而言,涉及一种语音信息处理方法、语音信息处理装置、计算机可读存储介质和电子设备。
智能眼镜作为一种可穿戴设备,将计算机技术与传统眼镜相结合,以实现丰富的功能。这些功能包括通过语音实现人机交互,语音处理过程的引入大大提高了智能眼镜的使用便捷性。
目前,语音处理过程依赖于智能眼镜配备的处理器。然而,受限于智能眼镜的处理器的运算能力,智能眼镜在进行语音处理时,常会遇到识别不佳、交互效果差的问题。如果智能眼镜配备高性能的处理器,则会增加智能眼镜的造价,普通用户无法接受。
发明内容
根据本公开的第一方面,提供了一种语音信息处理方法,应用于第一设备,该语音信息处理方法包括:获取第一语音信息,如果第一语音信息包含唤醒关键词,则向第二设备发送语音助手唤醒指令,以便第二设备启动语音助手;获取第二语音信息,将第二语音信息发送给第二设备,以便第二设备利用语音助手确定语音触发事件,语音触发事件与第二语音信息对应;接收由第二设备反馈的目标信息,并基于目标信息执行语音触发事件。
根据本公开的第二方面,提供了一种语音信息处理方法,应用于第二设备,该语音信息处理方法包括:响应语音助手唤醒指令,启动语音助手;其中,语音助手唤醒指令由第一设备在确定出第一语音信息包含唤醒关键词的情况下发送给第二设备;获取第一设备发送的第二语音信息,利用语音助手确定与第二语音信息对应的语音触发事件;将与执行语音触发事件相关联的目标信息反馈给第一设备,以便第一设备基于目标信息执行语音触发事件。
根据本公开的第三方面,提供了一种语音信息处理装置,应用于第一设备,包括:唤醒触发模块,被配置为获取第一语音信息,如果第一语音信息包含唤醒关键词,则向第二设备发送语音助手唤醒指令,以便第二设备启动语音助手;语音发送模块,被配置为获取第二语音信息,将第二语音信息发送给第二设备,以便第二设备利用语音助手确定语音触发事件,语音触发事件与第二语音信息对应;事件执行模块,被配置为接收由第二设备反馈的目标信息,并基于目标信息执行语音触发事件。
根据本公开的第四方面,提供了一种语音信息处理装置,应用于第二设备,包括:语音助手启动模块,被配置为响应语音助手唤醒指令,启动语音助手;其中,语音助手唤醒指令由第一设备在确定出第一语音信息包含唤醒关键词的情况下发送给第二设备;事件确定模块,被配置为获取第一设备发送的第二语音信息,利用语音助手确定与第二语音信息对应的语音触发事件;信息反馈模块,被配置为将与执行语音触发事件相关联的目标信息反馈给第一设备,以便第一设备基于目标信息执行语音触发事件。
根据本公开的第五方面,提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的语音信息处理方法。
根据本公开的第六方面,提供了一种电子设备,包括处理器;存储器,被配置为存储 一个或多个程序,当一个或多个程序被处理器执行时,使得所述处理器实现上述的语音信息处理方法。
图1示出了应用本公开实施例的语音信息处理方案的示例性系统架构的示意图;
图2示出了适于用来实现本公开实施例的电子设备的结构示意图;
图3示意性示出了根据本公开的示例性实施方式的应用于第一设备的语音信息处理方法的流程图;
图4示意性示出了根据本公开的示例性实施方式的应用于第二设备的语音信息处理方法的流程图;
图5示意性示出了根据本公开的示例性实施方式的实现语音信息处理过程的设备交互图;
图6示意性示出了本公开实施例的语音信息处理方案的整体软硬件架构图;
图7示意性示出了根据本公开的示例性实施方式的应用于第一设备的语音信息处理装置的方框图;
图8示意性示出了根据本公开的另一示例性实施方式的应用于第一设备的语音信息处理装置的方框图;
图9示意性示出了根据本公开的又一示例性实施方式的应用于第一设备的语音信息处理装置的方框图;
图10示意性示出了根据本公开的示例性实施方式的应用于第二设备的语音信息处理装置的方框图;
图11示意性示出了根据本公开的另一示例性实施方式的应用于第二设备的语音信息处理装置的方框图。
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的步骤。例如,有的步骤还可以分解,而有的步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。另外,下面所有的术语“第一”、“第二”仅是为了区分的目的,不应作为本公开内容的限制。
图1示出了应用本公开实施例的语音信息处理方案的示例性系统架构的示意图。
如图1所示,系统架构可以包括第一设备1100和第二设备1200。其中,第一设备1100可以是接收语音信息并执行语音信息对应事件的设备,具体的,第一设备1100可以是AR (Augmented Reality,增强现实)眼镜、VR(Virtual Reality,虚拟现实)眼镜或MR(Mixed Reality,混合现实)眼镜,然而,第一设备1100还可以是例如智能头盔等具有显示功能的其他可穿戴设备。第二设备1200可以是配置有语音助手的设备,利用该语音助手对从第一设备1100接收到的语音信息进行分析,确定出语音信息对应的事件并将相关信息反馈给第一设备1100,以便第一设备1100执行语音信息对应的事件。具体的,第二设备1200可以是手机、平板、个人计算机等。
在本公开示例性实施方式的语音信息处理过程中,首先,第一设备1100获取第一语音信息,对第一语音信息进行关键词识别,确定第一语音信息是否包含唤醒关键词。在第一语音信息包含唤醒关键词的情况下,第一设备1100向第二设备1200发送语音助手唤醒指令。第二设备1200响应该语音助手唤醒指令启动安装于第二设备1200上的语音助手。
接下来,如果第一设备1100获取到第二语音信息,则可以将第二语音信息发送给第二设备1200,第二设备1200可以利用语音助手确定与第二语音信息对应的语音触发事件,并将与该语音触发事件相关联的目标信息发送给第一设备1100,第一设备1100可以基于该目标信息执行语音触发事件。
在一些实施例中,第二设备1200确定语音触发事件的过程,可以仅由第二设备1200来实现,也就是说,第二设备1200利用语音助手对第二语音信息进行分析(包括但不限于语音识别、语义识别、语音合成等操作),根据分析结果确定出语音触发事件。
在另一些实施例中,实现本公开语音信息处理过程的架构还可以包括服务器1300,在这种情况下,第二设备1200可以将第二语音信息发送给服务器1300,由服务器1300对第二语音信息进行分析,并将分析结果反馈给第二设备1200。
需要说明的是,本公开对语音触发事件的类型不做限制,以第一设备1100是AR眼镜为例,语音触发事件可以包括调节AR眼镜的音量、查看天气、接通电话、设置日程、录屏、截屏、打开/关闭相册、打开/关闭指定应用程序、关机等。
另外,第二设备1200可以将语音助手和/或语音触发事件的用户界面(User Interface,UI)发送给第一设备1100进行显示。
应当注意的是,第二设备1200发送给第一设备1100供第一设备1100显示的用户界面,与第二设备1200自身显示的用户界面可以不同。具体的,第二设备1200发送给第一设备1100的用户界面可以是由第二设备1200或服务器1300渲染生成的三维界面图像,而第二设备1200自身显示的用户界面通常为二维界面,并且二者的界面布置方式及内容也存在差别。
在第一设备1100例如为AR眼镜的实例中,可以呈现出三维立体效果,以便用户查看。
图2示出了适于用来实现本公开示例性实施方式的电子设备的示意图。本公开示例性实施方式中的第一设备和/或第二设备可以配置为如图2所示的形式。需要说明的是,图2示出的电子设备仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
本公开的电子设备至少包括处理器和存储器,存储器被配置为存储一个或多个程序,当一个或多个程序被处理器执行时,使得处理器可以实现本公开示例性实施方式的语音信息处理方法。
具体的,如图2所示,电子设备200可以包括:处理器210、内部存储器221、外部存储器接口222、通用串行总线(Universal Serial Bus,USB)接口230、充电管理模块240、电源管理模块241、电池242、天线1、天线2、移动通信模块250、无线通信模块260、音频模块270、扬声器271、受话器272、麦克风273、耳机接口274、传感器模块280、显示屏290、摄像模组291、指示器292、马达293、按键294以及用户标识模块(Subscriber Identification Module,SIM)卡接口295等。其中传感器模块280可以包括深度传感器、压力传感器、陀螺仪传感器、气压传感器、磁传感器、加速度传感器、距离传感器、接近光传感器、指纹传感器、温度传感器、触摸传感器、环境光传感器及骨传导传感器等。
可以理解的是,本申请实施例示意的结构并不构成对电子设备200的具体限定。在本申请另一些实施例中,电子设备200可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件、软件或软件和硬件的组合实现。
处理器210可以包括一个或多个处理单元,例如:处理器210可以包括应用处理器(Application Processor,AP)、调制解调处理器、图形处理器(Graphics Processing Unit,GPU)、图像信号处理器(Image Signal Processor,ISP)、控制器、视频编解码器、数字信号处理器(Digital Signal Processor,DSP)、基带处理器和/或神经网络处理器(Neural-etwork Processing Unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。另外,处理器210中还可以设置存储器,用于存储指令和数据。
USB接口230是符合USB标准规范的接口,具体可以是MiniUSB接口,MicroUSB接口,USBTypeC接口等。USB接口230可以用于连接充电器为电子设备200充电,也可以用于电子设备200与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
充电管理模块240用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。电源管理模块241用于连接电池242、充电管理模块240与处理器210。电源管理模块241接收电池242和/或充电管理模块240的输入,为处理器210、内部存储器221、显示屏290、摄像模组291和无线通信模块260等供电。
电子设备200的无线通信功能可以通过天线1、天线2、移动通信模块250、无线通信模块260、调制解调处理器以及基带处理器等实现。
移动通信模块250可以提供应用在电子设备200上的包括2G/3G/4G/5G等无线通信的解决方案。
无线通信模块260可以提供应用在电子设备200上的包括无线局域网(Wireless Local Area Networks,WLAN)(如无线保真(Wireless Fidelity,Wi-Fi)网络)、蓝牙(Bluetooth,BT)、全球导航卫星系统(Global Navigation Satellite System,GNSS)、调频(Frequency Modulation,FM)、近距离无线通信技术(Near Field Communication,NFC)、红外技术(Infrared,IR)等无线通信的解决方案。
电子设备200通过GPU、显示屏290及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏290和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器210可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
电子设备200可以通过ISP、摄像模组291、视频编解码器、GPU、显示屏290及应用处理器等实现拍摄功能。在一些实施例中,电子设备200可以包括1个或N个摄像模组291,N为大于1的正整数,若电子设备200包括N个摄像头,N个摄像头中有一个是主摄像头。
内部存储器221可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器221可以包括存储程序区和存储数据区。外部存储器接口222可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备200的存储能力。
电子设备200可以通过音频模块270、扬声器271、受话器272、麦克风273、耳机接口274及应用处理器等实现音频功能。例如音乐播放、录音等。
音频模块270用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块270还可以用于对音频信号编码和解码。在一些实施例中,音频模块270可以设置于处理器210中,或将音频模块270的部分功能模块设置于处理器210中。
扬声器271,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备200可以通过扬声器271收听音乐,或收听免提通话。受话器272,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备200接听电话或语音信息时,可以通过将受话器272靠近人 耳接听语音。麦克风273,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风273发声,将声音信号输入到麦克风273。电子设备200可以设置至少一个麦克风273。耳机接口274用于连接有线耳机。
针对电子设备200中传感器模块280可以包括的传感器,深度传感器用于获取景物的深度信息。压力传感器用于感受压力信号,可以将压力信号转换成电信号。陀螺仪传感器可以用于确定电子设备200的运动姿态。气压传感器用于测量气压。磁传感器包括霍尔传感器。电子设备200可以利用磁传感器检测翻盖皮套的开合。加速度传感器可检测电子设备200在各个方向上(一般为三轴)加速度的大小。距离传感器用于测量距离。接近光传感器可以包括例如发光二极管(LED)和光检测器,例如光电二极管。指纹传感器用于采集指纹。温度传感器用于检测温度。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏290提供与触摸操作相关的视觉输出。环境光传感器用于感知环境光亮度。骨传导传感器可以获取振动信号。
按键294包括开机键,音量键等。按键294可以是机械按键。也可以是触摸式按键。马达293可以产生振动提示。马达293可以用于来电振动提示,也可以用于触摸振动反馈。指示器292可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。SIM卡接口295用于连接SIM卡。电子设备200通过SIM卡和网络交互,实现通话以及数据通信等功能。
本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
计算机可读存储介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现如下述实施例中所述的方法。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。
图3示意性示出了本公开的示例性实施方式的应用于第一设备的语音信息处理方法的流程图。参考图3,应用于第一设备的语音信息处理方法可以包括以下步骤:
S32.获取第一语音信息,如果第一语音信息包含唤醒关键词,则向第二设备发送语音助手唤醒指令,以便第二设备启动语音助手。
在本公开的示例性实施方式中,第一设备配备有语音收录装置,具体的,第一设备可以配备有单个麦克风或麦克风阵列,以获取第一设备周围的语音信息。
另外,第一设备还配备有DSP(Digital Signal Process)芯片,被配置为在获取第一语音信息的情况下,对第一语音信息进行分析,以确定第一语音信息是否包含唤醒关键词,该唤醒关键词用于唤醒第二设备上的语音助手。
容易理解的是,第一设备通常配备有声卡。在一些实施例中,第一设备配备的语音收录装置和DSP芯片均可以与声卡连接,通过声卡将语音收录装置获取到的第一语音信息发送给DSP芯片。然而,在这种情况下,需要启动声卡才能使DSP芯片获取到第一语音信息。
在另一些实施例中,第一设备配备的语音收录装置可以直接与DSP芯片连接,由此,DSP芯片获取第一语音信息的过程无需依赖声卡的启动即可实现。
针对步骤S32,第一设备的语音收录装置可以获取第一语音信息,DSP芯片可以对第一语音信息进行分析,判断第一语音信息中是否包含唤醒关键词,在第一语音信息包含唤醒关键词的情况下,第一设备可以向第二设备发送语音助手唤醒指令,第二设备可以响应该语音助手唤醒指令启动语音助手。
在本公开的一些实施例中,第一设备与第二设备通过USB连接,这种有线的连接方式可以避免信息传输丢帧的问题,提高了语音处理的准确性。
在本公开的另一些实施例中,第一设备还可以通过蓝牙或WiFi的方式与第二设备连接,本公开对此不做限制。
可以理解的是,在第一设备获取到包含唤醒关键词的第一语音信息之前,第二设备的语音助手处于关闭状态,整个系统处于半休眠状态,这样有助于减少系统的功耗。
此外,在第二设备启动语音助手后,第二设备可以将语音助手的三维界面图像发送给第一设备,以便在第一设备的显示端可以显示出语音助手的三维界面图像,例如,用户可以在AR眼镜的镜片上看到语音助手的三维界面图像,以提示用户语音助手已开启,可以进行语音命令的录入。
语义助手的三维界面图像可以是由第二设备渲染生成,具体的,可以是在响应语音助手唤醒指令后进行渲染而生成的图像。也可以是预先渲染生成并存储的图像,响应语音助手唤醒指令而调取出的图像,本公开对渲染的方式、时间等均不做限制。
S34.获取第二语音信息,将第二语音信息发送给第二设备,以便第二设备利用语音助手确定语音触发事件。
在第二设备启动语音助手的情况下,第一设备的麦克风可以获取第二语音信息,并将第二语音信息发送给第二设备。应该理解的是,在第二设备的语音助手开启的情况下,第一设备可以控制用于进行唤醒关键词识别的DSP芯片处于休眠状态。
在第一设备与第二设备通过USB连接的情况下,可以通过USB重定向的方式使第二设备获取到第二语音信息。也就是说,第一设备的语音收录装置被重定向为第二设备的语音输入,第二设备的语音助手可以监听第一设备录入的语音。
针对USB重定向过程,具体的,首先,通过第二设备中与第一设备匹配的服务可以监听第一设备的USB插入事件,应该理解的是,该服务仅响应第一设备的USB插入事件,而不响应其他设备的USB插入事件;接下来,第二设备可以通过回调函数将自身的语音收录装置重置为第一设备的语音收录装置。
在第一设备为AR眼镜且第二设备为手机的实例中,首先,手机的AR service(AR服务)可以监听AR眼镜USB插入事件,同时可以提供安全验证;接下来,手机可以将其语音收录装置重置为AR眼镜的麦克风;然后,用户直接通过AR眼镜的麦克风录入语音信 息,该语音信息会直接传送至手机的语音收录装置,并传递给手机上实现与AR眼镜交局的上层应用(即,语音助手)。
根据本公开的一些实施例,第二设备可以利用语音助手对第二语音信息进行分析(包括但不限于语音识别、语义识别、语音合成等操作),以确定出与第二语音信息对应的语音触发事件。例如,分析出的语音触发事件为“调低音量”;又如,分析出的语音触发事件为“关机”;再如,分析出的语音触发事件为“打开相册”,等等。
在这些实施例中,可以在第二设备不联网(即,不连接服务器)的情况下实现。
根据本公开的另一些实施例,第二设备可以利用语音助手将第二语音信息发送给服务器,由服务器对第二语音信息进行分析,确定出语音触发事件,并将语音触发事件反馈给第二设备。另外,服务器可以根据业务实际需求实现动态扩容服务。
在一个实施例中,服务器可以基于第一语音处理过程确定出语音触发事件。该第一语音处理过程可以包括:首先,服务器对第二语音信息进行语音识别,以将第二语音信息转换为文本信息;接下来,服务器可以对该文本信息进行语义识别,确定出文本信息对应的语义,并生成语义识别结果,其中,语义识别结果可以以文本的形式呈现;然后,服务器根据语义识别结果确定出与第二语音信息对应的语音触发事件。在这种情况下,语音触发事件可以例如为控制第一设备的事件。例如,控制第一设备关机。
在另一个实施例中,服务器可以基于第二语音处理过程确定出语音触发事件。该第二语音处理过程可以包括:首先,服务器对第二语音信息进行语音识别,以将第二语音信息转换为文本信息;接下来,服务器可以对该文本信息进行语义识别,得到语音识别结果;然后,服务器可以根据该语音识别结果确定与第二语音信息对应的响应文本,并将该响应文本转换为第三语音信息,作为语音触发事件包含的信息。在这种情况下,第二语音信息可以是询问天气的信息,对应的语音触发事件可以是语音播报天气情况,而上述响应文本指的是以文本形式确定出的实时天气信息,第三语音信息是对响应文本进行文本转换语音的过程而生成的语音信息,例如为“晴天”、“小雨”等与实时天气相关的语音信息,可以对应于下述向第二设备反馈的目标信息。
需要说明的是,一方面,上面语音分析的过程仅是示例,本公开对具体过程不做限制;另一方面,上述语音分析的过程也可以在第二设备上实现,在这种情况下,方案可以不需要服务器的参与。
S36.接收由第二设备反馈的目标信息,并基于目标信息执行语音触发事件。
在第二设备确定出与第二语音信息对应的语音触发事件之后,第二设备可以确定出与执行该语音触发事件相关联的目标信息,并将该目标信息发送给第一设备,以便第一设备基于该目标信息执行语音触发事件。其中,目标信息可以是控制第一设备状态的指令,或者可以是与显示语音触发事件对应内容相关的信息。
以语音触发事件为“调低音量”为例,目标信息可以是调低第一设备音量的指令,第一设备在接收到该指令后,可以调低自身音量。
又如,以语音触发事件为“打开相册”为例,目标信息可以包括相册内各图像的信息,以便第一设备接收到目标信息后,可以在第一设备的显示端显示出相册中包含的图像。例如,可以在AR眼镜的镜片上显示出相册中包含的照片。
再如,以语音触发事件为“关机”为例,目标信息可以是控制第一设备关机的指令,第一设备在接收到该指令后,可以关机。
可以理解的是,第二设备的语音助手可以集成语音触发事件与本地系统命令的映射关系,以便在确定出语音触发事件的情况下,可以利用这些系统命令对第一设备进行控制。
此外,在第二设备利用语音助手确定出语音触发事件后,本公开示例性实施方式还包括向第一设备发送与语音触发事件对应的三维界面图像的方案。
具体的,第二设备可以确定出语音触发事件对应的三维界面图像,并将该三维界面图 像发送给第一设备,以便第一设备在显示端显示出该三维界面图像,例如,用户可以在AR眼镜的镜片上看到“调低音量”对应的三维界面图像。
与上述语音助手的三维界面图像类似地,与语音触发事件对应的三维界面图像由第二设备渲染生成,且本公开对渲染的方式、时间等均不做限制。可以理解的是,本公开所述的三维界面图像可以包含文字、符号、静态图片、动态图片、视频中的一种或多种的组合。
考虑到用户可能较长时间未发出语音而导致功耗增加的问题,在本公开的一些实施例中,在第一设备向第二设备发送语音助手唤醒指令后,第一设备开始计时,经历预定时间段(例如,20秒)后,如果未获取到第二语音信息,则第一设备向第二设备发送语音助手关闭指令,第二设备可以响应该语音助手关闭指令,关闭语音助手。
另外,也可以第二设备自行计时,经历预定时间段后,未获取到第二语音信息,则第二设备自行关闭语音助手。
本公开示例性实施方式还提供了一种应用于第二设备的语音信息处理方法的流程图。参考图4,应用于第二设备的语音信息处理方法可以包括以下步骤:
S42.响应语音助手唤醒指令,启动语音助手;其中,语音助手唤醒指令由第一设备在确定出第一语音信息包含唤醒关键词的情况下发送给第二设备;
S44.获取第一设备发送的第二语音信息,利用语音助手确定与第二语音信息对应的语音触发事件;
S46.将与执行语音触发事件相关联的目标信息反馈给第一设备,以便第一设备基于目标信息执行语音触发事件。
步骤S42至步骤S46的具体过程已在上述步骤S32至步骤S36中进行了详细说明,在此不再赘述。
下面将参考图5对本公开一个实施例的实现语音处理过程的设备交互图进行说明。
在步骤S502中,第一设备获取第一语音信息,并通过配备的DSP芯片判断第一语音信息中是否包含唤醒关键词,其中,唤醒关键词可以是用户自定义的关键词,例如,“小布”。在步骤S504中,如果第一语音信息包含唤醒关键词,则第一设备可以向第二设备发送语音助手唤醒指令。
在步骤S506中,第二设备启动语音助手。另外,在步骤S508中,第二设备可以向第一设备发送语音助手的三维界面图像。在步骤S510中,第一设备可以在其显示端显示出语音助手的三维界面图像。例如,可以在AR眼镜的镜片上显示出语音助手的三维界面图像。
步骤S502至步骤S510示例性描述了通过第一设备的唤醒服务来启动第二设备的语音助手的过程,下面将参考步骤S512至步骤S528来说明通过语音控制第一设备的过程。
在步骤S512中,第一设备获取第二语音信息,并将第二语音信息发送给第二设备。
在步骤S514中,第二设备将第二语音信息发送给服务器。在步骤S516中,服务器对第二语音信息进行分析,得到分析结果。在步骤S518中,服务器将第二语音信息的分析结果反馈给第二设备。在步骤S520中,第二设备根据服务器的分析结果确定语音触发事件。
在步骤S522中,第二设备向第一设备发送目标信息,该目标信息与执行语音触发事件相关联,也就是说,目标信息是第一设备能够执行语音触发事件必须的信息。在步骤S524中,第一设备执行语音触发事件。
另外,在步骤S526中,第二设备向第一设备发送与语音触发事件对应的三维界面图像。在步骤S528中,第一设备显示与语音触发事件对应的三维界面图像。
下面将参考图6对本公开实施例的语音信息处理方案的整体软硬件架构进行说明。在该实施例中,第一设备为AR眼镜,第二设备为手机。其中,AR眼镜与手机通过USB连接,手机与服务器可以通过3G、4G、5G、WiFi等方式连接。
针对AR眼镜,其上的操作系统可以是RTOS(实时操作系统),AR眼镜本身配备有DSP芯片,用于提供唤醒服务。在AR眼镜的麦克风获取到上述第一语音信息时,利用关键词识别引擎来确定第一语音信息是否包含唤醒关键词。另外,在唤醒手机的语音助手的情况下,AR眼镜可以呈现出由手机渲染的语音助手的三维用户界面。
针对手机,操作系统中可以配置有AR软件平台(例如,ARCore或ARKit)。在AR软件平台的基础上,可以配置有语音助手的应用程序(APP),可以响应AR眼镜的唤醒服务生成的指令启动语音助手APP。通过语音助手用户界面的UI交互,可以对AR眼镜的麦克风进行复位,另外,还可以对应AR眼镜上显示出的语音助手的用户界面。
此外,利用手机中语音助手APP提供的语音软件开发工具包,可以实现与服务器的语音语义分析引擎的交互,以将上述第二语音信息发送给服务器,由服务器的语音语义分析引擎对第二语音信息进行分析,并反馈分析结果。
需要说明的是,图6的架构仅是示例,不应作为本公开内容的限制。
应当注意,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
进一步的,本示例实施方式中还提供了一种应用于第一设备的语音信息处理装置。
图7示意性示出了本公开的示例性实施方式的应用于第一设备的语音信息处理装置的方框图。参考图7,根据本公开的示例性实施方式的应用于第一设备的语音信息处理装置7可以包括唤醒触发模块71、语音发送模块73和事件执行模块75。
具体的,唤醒触发模块71可以被配置为获取第一语音信息,如果第一语音信息包含唤醒关键词,则向第二设备发送语音助手唤醒指令,以便第二设备启动语音助手;语音发送模块73可以被配置为获取第二语音信息,将第二语音信息发送给第二设备,以便第二设备利用语音助手确定与第二语音信息对应的语音触发事件;事件执行模块75可以被配置为接收由第二设备反馈的与执行语音触发事件相关联的目标信息,并基于目标信息执行语音触发事件。
根据本公开的示例性实施例,参考图8,相比于语音信息处理装置7,语音信息处理装置8还可以包括图像显示模块81。
具体的,图像显示模块81可以被配置为执行:在第二设备启动语音助手后,接收语音助手的三维界面图像;其中,语音助手的三维界面图像由第二设备渲染生成;在第一设备的显示端显示出语音助手的三维界面图像。
根据本公开的示例性实施例,第一设备与第二设备通过USB连接,在这种情况下,语音发送模块73可以被配置为执行:获取第二语音信息;通过USB重定向的方式将第二语音信息发送给第二设备。
根据本公开的示例性实施例,图像显示模块81还可以被配置为执行:接收与语音触发事件对应的三维界面图像;其中,与语音触发事件对应的三维界面图像由第二设备渲染生成;在第一设备的显示端显示出与语音触发事件对应的三维界面图像。
根据本公开的示例性实施例,参考图9,相比于语音信息处理装置7,语音信息处理装置9还可以包括关闭触发模块91。
具体的,关闭触发模块91可以被配置为执行:在向第二设备发送语音助手唤醒指令后,开始计时;经历预定时间段后,如果未获取到第二语音信息,则向第二设备发送语音助手关闭指令,以便第二设备关闭语音助手。
进一步的,本示例实施方式中还提供了一种应用于第二设备的语音信息处理装置。
图10示意性示出了本公开的示例性实施方式的应用于第二设备的语音信息处理装置的方框图。参考图10,根据本公开的示例性实施方式的应用于第二设备的语音信息处理装 置10可以包括语音助手启动模块101、事件确定模块103和信息反馈模块105。
具体的,语音助手启动模块101可以被配置为响应语音助手唤醒指令,启动语音助手;其中,语音助手唤醒指令由第一设备在确定出第一语音信息包含唤醒关键词的情况下发送给第二设备;事件确定模块103可以被配置为获取第一设备发送的第二语音信息,利用语音助手确定与第二语音信息对应的语音触发事件;信息反馈模块105可以被配置为将与执行语音触发事件相关联的目标信息反馈给第一设备,以便第一设备基于目标信息执行语音触发事件。
根据本公开的示例性实施例,信息反馈模块105还可以被配置为执行:在启动语音助手后,向第一设备发送语音助手的三维界面图像,以便第一设备在显示端显示出语音助手的三维界面图像;其中,语音助手的三维界面图像由第二设备渲染生成。
根据本公开的示例性实施例,第一设备与第二设备通过USB连接,在这种情况下,事件确定模块103获取第一设备发送的第二语音信息的过程可以被配置为执行:通过USB重定向的方式从第一设备获取第二语音信息。
根据本公开的示例性实施例,事件确定模块103还可以被配置为执行:利用语音助手将第二语音信息发送给服务器,以便服务器确定与第二语音信息对应的语音触发事件;获取由服务器确定出的与第二语音信息对应的语音触发事件。
根据本公开的示例性实施例,信息反馈模块105还可以被配置为执行:在利用语音助手确定与第二语音信息对应的语音触发事件后,向第一设备发送与语音触发事件对应的三维界面图像,以便第一设备在显示端显示出与语音触发事件对应的三维界面图像;其中,与语音触发事件对应的三维界面图像由第二设备渲染生成。
根据本公开的示例性实施例,参考图11,相比于语音信息处理装置10,语音信息处理装置11还可以包括语音助手关闭模块111。
具体的,语音助手关闭模块111可以被配置为执行:响应语音助手关闭指令,关闭语音助手。
由于本公开实施方式的语音信息处理装置的各个功能模块与上述方法实施方式中相同,因此在此不再赘述。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。
此外,上述附图仅是根据本公开示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
本领域技术人员在考虑说明书及实践这里公开的内容后,将容易想到本公开的其他实施例。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可 以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限。
Claims (20)
- 一种语音信息处理方法,应用于第一设备,包括:获取第一语音信息,如果所述第一语音信息包含唤醒关键词,则向第二设备发送语音助手唤醒指令,以便所述第二设备启动语音助手;获取第二语音信息,将所述第二语音信息发送给所述第二设备,以便所述第二设备利用所述语音助手确定语音触发事件,所述语音触发事件与所述第二语音信息对应;接收由所述第二设备反馈的目标信息,并基于所述目标信息执行所述语音触发事件。
- 根据权利要求1所述的语音信息处理方法,其中,在所述第二设备启动语音助手后,所述语音信息处理方法还包括:接收所述语音助手的三维界面图像;其中,所述语音助手的三维界面图像由所述第二设备渲染生成;在所述第一设备的显示端显示出所述语音助手的三维界面图像。
- 根据权利要求1所述的语音信息处理方法,其中,所述第一设备与所述第二设备通过USB连接;其中,获取第二语音信息,将所述第二语音信息发送给所述第二设备,包括:获取所述第二语音信息;通过USB重定向的方式将所述第二语音信息发送给所述第二设备。
- 根据权利要求1或2所述的语音信息处理方法,其中,所述语音信息处理方法还包括:接收与所述语音触发事件对应的三维界面图像;其中,与所述语音触发事件对应的三维界面图像由所述第二设备渲染生成;在所述第一设备的显示端显示出与所述语音触发事件对应的三维界面图像。
- 根据权利要求1至3中任一项所述的语音信息处理方法,其中,所述语音信息处理方法还包括:在向所述第二设备发送所述语音助手唤醒指令后,开始计时;经历预定时间段后,如果未获取到所述第二语音信息,则向所述第二设备发送语音助手关闭指令,以便所述第二设备关闭所述语音助手。
- 一种语音信息处理方法,应用于第二设备,包括:响应语音助手唤醒指令,启动语音助手;其中,所述语音助手唤醒指令由第一设备在确定出第一语音信息包含唤醒关键词的情况下发送给所述第二设备;获取所述第一设备发送的第二语音信息,利用所述语音助手确定与所述第二语音信息对应的语音触发事件;将与执行所述语音触发事件相关联的目标信息反馈给所述第一设备,以便所述第一设备基于所述目标信息执行所述语音触发事件。
- 根据权利要求6所述的语音信息处理方法,其中,在启动所述语音助手后,所述语音信息处理方法还包括:向所述第一设备发送所述语音助手的三维界面图像,以便所述第一设备在显示端显示出所述语音助手的三维界面图像;其中,所述语音助手的三维界面图像由所述第二设备渲染生成。
- 根据权利要求6所述的语音信息处理方法,其中,所述第一设备与所述第二设备通过USB连接;其中,获取所述第一设备发送的第二语音信息包括:通过USB重定向的方式从所述第一设备获取所述第二语音信息。
- 根据权利要求6所述的语音信息处理方法,其中,利用所述语音助手确定与所述第二语音信息对应的语音触发事件包括:利用所述语音助手将所述第二语音信息发送给服务器,以便所述服务器确定与所述第二语音信息对应的语音触发事件;获取由所述服务器确定出的语音触发事件,所述语音触发事件与所述第二语音信息对应。
- 根据权利要求9所述的语音信息处理方法,其中,所述语音触发事件基于所述服务器执行第一语音处理过程而确定出;其中,所述第一语音处理过程包括:将所述第二语音信息转换为文本信息;对所述文本信息进行语义识别,根据语义识别结果确定与所述第二语音信息对应的语音触发事件。
- 根据权利要求9所述的语音信息处理方法,其中,所述语音触发事件基于所述服务器执行第二语音处理过程而确定出,所述第二语音处理过程包括:将所述第二语音信息转换为文本信息;对所述文本信息进行语义识别,得到语义识别结果;根据所述语义识别结果确定与所述第二语音信息对应的响应文本;将所述响应文本转换为第三语音信息,作为所述语音触发事件包含的信息。
- 根据权利要求6或7所述的语音信息处理方法,其中,在利用所述语音助手确定与所述第二语音信息对应的语音触发事件后,所述语音信息处理方法还包括:向所述第一设备发送与所述语音触发事件对应的三维界面图像,以便所述第一设备在显示端显示出与所述语音触发事件对应的三维界面图像;其中,与所述语音触发事件对应的三维界面图像由所述第二设备渲染生成。
- 根据权利要求6至11中任一项所述的语音信息处理方法,其中,所述语音信息处理方法还包括:响应语音助手关闭指令,关闭所述语音助手。
- 一种语音信息处理装置,应用于第一设备,包括:唤醒触发模块,被配置为获取第一语音信息,如果所述第一语音信息包含唤醒关键词,则向第二设备发送语音助手唤醒指令,以便所述第二设备启动语音助手;语音发送模块,被配置为获取第二语音信息,将所述第二语音信息发送给所述第二设备,以便所述第二设备利用所述语音助手确定语音触发事件,所述语音触发事件与所述第二语音信息对应;事件执行模块,被配置为接收由所述第二设备反馈的目标信息,并基于所述目标信息执行所述语音触发事件。
- 根据权利要求14所述的语音信息处理装置,其中,所述语音信息处理装置还包括:图像显示模块,被配置为在所述第二设备启动所述语音助手后,接收所述语音助手的三维界面图像;其中,所述语音助手的三维界面图像由所述第二设备渲染生成;以及在所述第一设备的显示端显示出所述语音助手的三维界面图像。
- 根据权利要求14所述的语音信息处理装置,其中,所述第一设备与所述第二设备通过USB连接;语音发送模块还被配置为获取所述第二语音信息,通过USB重定向的方式将所述第二语音信息发送给所述第二设备。
- 一种语音信息处理装置,应用于第二设备,包括:语音助手启动模块,被配置为响应语音助手唤醒指令,启动语音助手;其中,所述语音助手唤醒指令由第一设备在确定出第一语音信息包含唤醒关键词的情况下发送给所述第二设备;事件确定模块,被配置为获取所述第一设备发送的第二语音信息,利用所述语音助手确定与所述第二语音信息对应的语音触发事件;信息反馈模块,被配置为将与执行所述语音触发事件相关联的目标信息反馈给所述第一设备,以便所述第一设备基于所述目标信息执行所述语音触发事件。
- 根据权利要求17所述的语音信息处理装置,其中,所述信息反馈模块还被配置为在启动所述语音助手后,向所述第一设备发送所述语音助手的三维界面图像,以便所述第一设备在显示端显示出所述语音助手的三维界面图像,所述语音助手的三维界面图像由所述第二设备渲染生成。
- 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如权利要求1至13中任一项所述的语音信息处理方法。
- 一种电子设备,包括:处理器;存储器,被配置为存储一个或多个程序,当所述一个或多个程序被所述处理器执行时,使得所述处理器实现如权利要求1至13中任一项所述的语音信息处理方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21807986.1A EP4123444A4 (en) | 2020-05-18 | 2021-03-17 | METHOD AND APPARATUS FOR PROCESSING VOICE INFORMATION, AND RECORDING MEDIUM AND ELECTRONIC DEVICE |
| US17/934,260 US12001758B2 (en) | 2020-05-18 | 2022-09-22 | Voice information processing method and electronic device |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010419583.7A CN111694605A (zh) | 2020-05-18 | 2020-05-18 | 语音信息处理方法及装置、存储介质和电子设备 |
| CN202010419583.7 | 2020-05-18 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/934,260 Continuation US12001758B2 (en) | 2020-05-18 | 2022-09-22 | Voice information processing method and electronic device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021232913A1 true WO2021232913A1 (zh) | 2021-11-25 |
Family
ID=72477935
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/081332 Ceased WO2021232913A1 (zh) | 2020-05-18 | 2021-03-17 | 语音信息处理方法及装置、存储介质和电子设备 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12001758B2 (zh) |
| EP (1) | EP4123444A4 (zh) |
| CN (1) | CN111694605A (zh) |
| WO (1) | WO2021232913A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114237386A (zh) * | 2021-11-29 | 2022-03-25 | 歌尔光学科技有限公司 | 虚拟现实设备的控制方法、虚拟现实设备以及存储介质 |
| CN115294983A (zh) * | 2022-09-28 | 2022-11-04 | 科大讯飞股份有限公司 | 一种自主移动设备唤醒方法、系统及基站 |
Families Citing this family (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111694605A (zh) * | 2020-05-18 | 2020-09-22 | Oppo广东移动通信有限公司 | 语音信息处理方法及装置、存储介质和电子设备 |
| CN112863511B (zh) * | 2021-01-15 | 2024-06-04 | 北京小米松果电子有限公司 | 信号处理方法、装置以及存储介质 |
| CN113066493B (zh) * | 2021-03-30 | 2023-01-06 | 联想(北京)有限公司 | 一种设备控制方法、系统及第一电子设备 |
| CN113157240B (zh) * | 2021-04-27 | 2025-10-31 | 百度在线网络技术(北京)有限公司 | 语音处理方法、装置、设备、存储介质及计算机程序产品 |
| CN113395688A (zh) * | 2021-05-27 | 2021-09-14 | Tcl通力电子(惠州)有限公司 | 语音数据的处理方法、设备及计算机可读存储介质 |
| CN113488038A (zh) * | 2021-06-17 | 2021-10-08 | 深圳Tcl新技术有限公司 | 智能设备的语音识别方法、系统及存储介质和终端设备 |
| CN114120993A (zh) * | 2021-07-15 | 2022-03-01 | 意欧斯物流科技(上海)有限公司 | 一种基于仓储的语音语义交互系统 |
| CN117423336B (zh) * | 2023-10-10 | 2024-11-19 | 阿波罗智联(北京)科技有限公司 | 音频数据处理方法、装置、电子设备及存储介质 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103558916A (zh) * | 2013-11-07 | 2014-02-05 | 百度在线网络技术(北京)有限公司 | 人机交互系统、方法及其装置 |
| CN107978316A (zh) * | 2017-11-15 | 2018-05-01 | 西安蜂语信息科技有限公司 | 控制终端的方法及装置 |
| US20180293982A1 (en) * | 2015-10-09 | 2018-10-11 | Yutou Technology (Hangzhou) Co., Ltd. | Voice assistant extension device and working method therefor |
| CN111096680A (zh) * | 2019-12-31 | 2020-05-05 | 广东美的厨房电器制造有限公司 | 烹饪设备、电子设备、语音服务器、语音控制方法和装置 |
| CN111694605A (zh) * | 2020-05-18 | 2020-09-22 | Oppo广东移动通信有限公司 | 语音信息处理方法及装置、存储介质和电子设备 |
Family Cites Families (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2008051939A2 (en) * | 2006-10-24 | 2008-05-02 | Medapps, Inc. | Systems and methods for medical data transmission |
| US8423478B2 (en) * | 2008-04-24 | 2013-04-16 | International Business Machines Corporation | Preferred customer service representative presentation to virtual universe clients |
| US9812126B2 (en) * | 2014-11-28 | 2017-11-07 | Microsoft Technology Licensing, Llc | Device arbitration for listening devices |
| CN111427534B (zh) * | 2014-12-11 | 2023-07-25 | 微软技术许可有限责任公司 | 能够实现可动作的消息传送的虚拟助理系统 |
| CN106469040B (zh) * | 2015-08-19 | 2019-06-21 | 华为终端有限公司 | 通信方法、服务器及设备 |
| CN105329187B (zh) * | 2015-11-05 | 2018-06-22 | 深圳市几米软件有限公司 | 蓝牙按键触发实现安全操作的智能车载系统及控制方法 |
| CN106653018A (zh) | 2016-11-04 | 2017-05-10 | 深圳市元征科技股份有限公司 | 一种智能眼镜显示数据的方法及装置 |
| KR101889279B1 (ko) * | 2017-01-16 | 2018-08-21 | 주식회사 케이티 | 음성 명령에 기반하여 서비스를 제공하는 시스템 및 방법 |
| CN108470014B (zh) * | 2018-02-26 | 2021-05-04 | 广东思派康电子科技有限公司 | Usb type-c接头唤醒智能语音助手应用的方法 |
| CN109255064A (zh) | 2018-08-30 | 2019-01-22 | Oppo广东移动通信有限公司 | 信息搜索方法、装置、智能眼镜及存储介质 |
| CN110460833A (zh) * | 2019-07-19 | 2019-11-15 | 深圳市中视典数字科技有限公司 | 一种ar眼镜与智能手机互联方法及系统 |
| CN110444211A (zh) * | 2019-08-23 | 2019-11-12 | 青岛海信电器股份有限公司 | 一种语音识别方法及设备 |
| US12175010B2 (en) * | 2019-09-28 | 2024-12-24 | Apple Inc. | Devices, methods, and graphical user interfaces for interacting with three-dimensional environments |
| CN111161714B (zh) * | 2019-12-25 | 2023-07-21 | 联想(北京)有限公司 | 一种语音信息处理方法、电子设备及存储介质 |
-
2020
- 2020-05-18 CN CN202010419583.7A patent/CN111694605A/zh active Pending
-
2021
- 2021-03-17 WO PCT/CN2021/081332 patent/WO2021232913A1/zh not_active Ceased
- 2021-03-17 EP EP21807986.1A patent/EP4123444A4/en not_active Withdrawn
-
2022
- 2022-09-22 US US17/934,260 patent/US12001758B2/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103558916A (zh) * | 2013-11-07 | 2014-02-05 | 百度在线网络技术(北京)有限公司 | 人机交互系统、方法及其装置 |
| US20180293982A1 (en) * | 2015-10-09 | 2018-10-11 | Yutou Technology (Hangzhou) Co., Ltd. | Voice assistant extension device and working method therefor |
| CN107978316A (zh) * | 2017-11-15 | 2018-05-01 | 西安蜂语信息科技有限公司 | 控制终端的方法及装置 |
| CN111096680A (zh) * | 2019-12-31 | 2020-05-05 | 广东美的厨房电器制造有限公司 | 烹饪设备、电子设备、语音服务器、语音控制方法和装置 |
| CN111694605A (zh) * | 2020-05-18 | 2020-09-22 | Oppo广东移动通信有限公司 | 语音信息处理方法及装置、存储介质和电子设备 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4123444A4 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114237386A (zh) * | 2021-11-29 | 2022-03-25 | 歌尔光学科技有限公司 | 虚拟现实设备的控制方法、虚拟现实设备以及存储介质 |
| WO2023092636A1 (zh) * | 2021-11-29 | 2023-06-01 | 歌尔股份有限公司 | 虚拟现实设备的控制方法、虚拟现实设备以及存储介质 |
| CN115294983A (zh) * | 2022-09-28 | 2022-11-04 | 科大讯飞股份有限公司 | 一种自主移动设备唤醒方法、系统及基站 |
Also Published As
| Publication number | Publication date |
|---|---|
| US12001758B2 (en) | 2024-06-04 |
| EP4123444A4 (en) | 2023-11-15 |
| EP4123444A1 (en) | 2023-01-25 |
| US20230010969A1 (en) | 2023-01-12 |
| CN111694605A (zh) | 2020-09-22 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2021232913A1 (zh) | 语音信息处理方法及装置、存储介质和电子设备 | |
| JP7696976B2 (ja) | アプリケーションエコシステムを備える、ウェアラブルマルチメディアデバイス及びクラウドコンピューティングプラットフォーム | |
| JP7142783B2 (ja) | 音声制御方法及び電子装置 | |
| WO2021052263A1 (zh) | 语音助手显示方法及装置 | |
| WO2020192456A1 (zh) | 一种语音交互方法及电子设备 | |
| CN109286725B (zh) | 翻译方法及终端 | |
| CN111476911A (zh) | 虚拟影像实现方法、装置、存储介质与终端设备 | |
| CN114185503B (zh) | 多屏交互的系统、方法、装置和介质 | |
| CN113449068B (zh) | 一种语音交互方法及电子设备 | |
| WO2021000817A1 (zh) | 环境音处理方法及相关装置 | |
| CN110798327B (zh) | 消息处理方法、设备及存储介质 | |
| WO2023093092A1 (zh) | 会议记录方法、终端设备和会议记录系统 | |
| CN111930335A (zh) | 声音调节方法及装置、计算机可读介质及终端设备 | |
| US20230275986A1 (en) | Accessory theme adaptation method, apparatus, and system | |
| CN111524518B (zh) | 增强现实处理方法及装置、存储介质和电子设备 | |
| WO2023006035A1 (zh) | 一种投屏方法、系统及电子设备 | |
| WO2018155052A1 (ja) | 情報処理装置、情報処理方法および情報処理システム | |
| US20240045651A1 (en) | Audio Output Method, Media File Recording Method, and Electronic Device | |
| CN113867851A (zh) | 电子设备操作引导信息录制方法、获取方法和终端设备 | |
| WO2022068654A1 (zh) | 一种终端设备交互方法及装置 | |
| CN111916105A (zh) | 语音信号处理方法、装置、电子设备及存储介质 | |
| CN111613252B (zh) | 音频录制的方法、装置、系统、设备及存储介质 | |
| CN223022643U (zh) | 一种可穿戴设备 | |
| US20240328804A1 (en) | Display method and electronic device | |
| KR102192010B1 (ko) | 추론 엔진 및 배포 엔진 기반 콘텐츠 제공 방법 및 이를 사용하는 전자 장치 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21807986 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021807986 Country of ref document: EP Effective date: 20221017 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2021807986 Country of ref document: EP |