WO2023010861A1 - 一种唤醒处理方法、装置、设备和计算机存储介质 - Google Patents

一种唤醒处理方法、装置、设备和计算机存储介质 Download PDF

Info

Publication number
WO2023010861A1
WO2023010861A1 PCT/CN2022/082571 CN2022082571W WO2023010861A1 WO 2023010861 A1 WO2023010861 A1 WO 2023010861A1 CN 2022082571 W CN2022082571 W CN 2022082571W WO 2023010861 A1 WO2023010861 A1 WO 2023010861A1
Authority
WO
WIPO (PCT)
Prior art keywords
wake
confidence
training data
confidence level
sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/082571
Other languages
English (en)
French (fr)
Inventor
陈柏仰
陈奕荣
霍伟明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GD Midea Air Conditioning Equipment Co Ltd
Foshan Shunde Midea Electric Science and Technology Co Ltd
Original Assignee
GD Midea Air Conditioning Equipment Co Ltd
Foshan Shunde Midea Electric Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GD Midea Air Conditioning Equipment Co Ltd, Foshan Shunde Midea Electric Science and Technology Co Ltd filed Critical GD Midea Air Conditioning Equipment Co Ltd
Priority to EP22851586.2A priority Critical patent/EP4383250A4/en
Priority to JP2024531560A priority patent/JP7743630B2/ja
Publication of WO2023010861A1 publication Critical patent/WO2023010861A1/zh
Priority to US18/431,630 priority patent/US20240177707A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/2803Home automation networks
    • H04L12/2816Controlling appliance services of a home automation network by calling their functionalities
    • H04L12/282Controlling appliance services of a home automation network by calling their functionalities based on user interaction within the home
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the technical field of speech recognition, and in particular, to a wake-up processing method, device, device and computer storage medium.
  • voice recognition technology home intelligence has become a trend, and voice equipment has gradually penetrated into people's daily life.
  • voice devices there are many types of voice devices in many user families. Before performing voice control on the voice devices, it is necessary to wake up the voice devices.
  • the present disclosure expects to provide a wake-up processing method, device, device and computer storage medium, which can avoid the possibility of wake-up word crosstalk when different wake-up words are trained at the same time, and reduce the false wake-up rate of voice equipment.
  • an embodiment of the present disclosure provides a wake-up processing method applied to a voice device, and the method includes:
  • a wakeup event of the voice device is triggered.
  • the acquiring the audio to be recognized includes:
  • each set of training data includes model parameters and confidence thresholds; the audio to be recognized is respectively processed by using the wake-up model and at least two sets of training data to obtain at least two confidence levels and their respective corresponding Confidence thresholds, including:
  • the at least two sets of training data include a first set of training data and a second set of training data, the first set of training data includes first model parameters and a first confidence threshold, and the second set the training data includes second model parameters and a second confidence threshold;
  • the process of using the wake-up model and at least two sets of training data to process the audio to be identified respectively to obtain at least two confidence levels and corresponding confidence thresholds includes:
  • the triggering the wake-up event of the voice device according to the comparison results between the at least two confidence levels and their corresponding confidence thresholds includes:
  • first confidence level is greater than or equal to the first confidence level threshold, or the second confidence level is greater than or equal to the second confidence level threshold, a wakeup event of the voice device is triggered.
  • the wake-up event includes a first wake-up event and/or a second wake-up event; wherein, the first wake-up event is associated with a wake-up word corresponding to the first set of training data, and the second wake-up event The two wake-up events are associated with the wake-up words corresponding to the second set of training data.
  • the triggering the wake-up event of the voice device according to the comparison results between the at least two confidence levels and their corresponding confidence thresholds includes:
  • first confidence level is greater than or equal to the first confidence level threshold and the second confidence level is less than the second confidence level threshold, triggering the first wakeup event of the voice device; or,
  • the second confidence level is greater than or equal to the second confidence level threshold and the first confidence level is less than the first confidence level threshold, triggering the second wakeup event of the voice device; or,
  • first confidence level is greater than or equal to the first confidence level threshold and the second confidence level is greater than or equal to the second confidence level threshold, calculating the first confidence level exceeds the first confidence level
  • the first value of the threshold and the second confidence level exceed the second value of the second confidence level threshold, and according to the first value and the second value, a target wakeup event of the voice device is triggered.
  • the triggering a target wakeup event of the voice device according to the first value and the second value includes:
  • the target wake-up event is the first wake-up event and trigger it; or,
  • the method also includes:
  • the wake-up model is trained by using the at least two sets of wake-up word training sets to obtain the at least two sets of training data; wherein, each set of training data includes model parameters and confidence thresholds.
  • the acquiring the at least two groups of wake-up word training sets includes:
  • the initial training set is grouped according to different wake-up words to obtain the at least two groups of wake-up word training sets.
  • an embodiment of the present disclosure provides a wake-up processing device, which is applied to a voice device, and the wake-up processing device includes an acquisition unit, a processing unit, and a trigger unit; wherein,
  • the acquisition unit is configured to acquire audio to be identified
  • the processing unit is configured to use the wake-up model and at least two sets of training data to respectively process the audio to be recognized to obtain at least two confidence levels and corresponding confidence thresholds; wherein the at least two sets of training data are Obtained by training at least two sets of wake-up word training sets through the wake-up model respectively;
  • the triggering unit is configured to trigger a wake-up event of the voice device according to a comparison result between the at least two confidence levels and respective corresponding confidence level thresholds.
  • an embodiment of the present disclosure provides a voice device, where the voice device includes a memory and a processor; wherein,
  • said memory for storing a computer program capable of running on said processor
  • the processor is configured to execute the method according to any one of the first aspect when running the computer program.
  • an embodiment of the present disclosure provides a computer storage medium, the computer storage medium stores a computer program, and when the computer program is executed by at least one processor, the method according to any one of the first aspect is implemented .
  • Embodiments of the present disclosure provide a wake-up processing method, device, device, and computer storage medium to acquire audio to be recognized; use a wake-up model and at least two sets of training data to process the audio to be recognized respectively, and obtain at least two confidence levels and respective corresponding confidence thresholds; wherein, at least two sets of training data are obtained by at least two sets of wake-up word training sets respectively trained through the wake-up model; according to the at least two confidence levels and the corresponding confidence thresholds The result of the comparison between them determines the device to be woken up.
  • FIG. 1 is a schematic flowchart of a wake-up processing method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of another wake-up processing method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a training process of a wake-up model provided by an embodiment of the present disclosure
  • FIG. 4 is a detailed flowchart of a wake-up processing method provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of a wake-up processing device provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a specific hardware structure of a speech device provided by an embodiment of the present disclosure.
  • first ⁇ second ⁇ third involved in the embodiments of the present disclosure are only used to distinguish similar objects, and do not represent a specific ordering of objects. Understandably, “first ⁇ second ⁇ third 3" where permitted, the specific order or sequence may be interchanged such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein.
  • an embodiment of the present disclosure provides a wake-up processing method.
  • the basic idea of the method is: acquire the audio to be recognized; use the wake-up model and at least two sets of training data to process the audio to be recognized respectively, and obtain at least two Confidence and respective corresponding confidence thresholds; wherein, the at least two sets of training data are obtained by at least two sets of wake-up word training sets through the wake-up model respectively trained; according to the at least two confidences and the respective corresponding
  • the comparison result between the confidence thresholds determines the device to be woken up.
  • FIG. 1 it shows a schematic flowchart of a wake-up processing method provided by an embodiment of the present disclosure. As shown in Figure 1, the method may include:
  • S101 Acquire audio to be recognized.
  • the wake-up processing method in the embodiment of the present disclosure is applied to a wake-up processing device, or a voice device integrated with the wake-up processing device.
  • the voice device can perform voice interaction with the user, and it is any device to be woken up that needs to be awakened by voice, such as any common household appliances such as voice air conditioner, voice water heater, voice rice cooker, and voice microwave oven, but is not limited in any way.
  • the voice device can perform voice interaction with the user, data collection can also be performed through the voice collection device at this time. Therefore, in some embodiments, for step S101, the acquiring the audio to be recognized may include:
  • the sound collection device may be an audio collector such as a microphone or a microphone.
  • the initial voice data from the user can be acquired through real-time data collection by the microphone; then the audio to be recognized can be obtained after preprocessing the initial voice data.
  • the initial voice data in the embodiment of the present disclosure includes the voice information of the user.
  • the initial voice data may be sent by the user, for example, "Xiaomei Xiaomei"; after the voice device acquires the voice information, it preprocesses the information.
  • the preprocessing may include two aspects: an endpoint detection process and a pre-emphasis process, which will be described in detail below.
  • the endpoint detection process refers to finding the start point and end point of the instruction audio, and several consecutive frames of sound clips may be intercepted from the sound information, and arranged in the first few frames according to the order of the sound clips It is set as the audio to be recognized, and the specific number of frames of the audio to be recognized can be determined according to the length of the wake-up word set. For example, a specific duration is preset according to the number of words in the wake-up word, and the sound segment within the duration is determined as the audio to be recognized. The specific duration can be adjusted according to the actual situation, and this embodiment does not make any limitation.
  • the number of audio frames to be processed can also be determined by detecting the length of empty data between two consecutive sound clips. For example, in actual use, the user may call out the wake-up word first, and then call out the rest of the voice after a pause of several seconds. instruction, the segment before the empty data can be used as the audio to be recognized.
  • the voice air conditioner receives a piece of audio "Xiaomei Xiaomei" through the sound collection device, and the wake-up word "Xiaomei Xiaomei" has a preset duration of two seconds.
  • the number of frames corresponding to the first 2 seconds should be intercepted as the audio to be processed.
  • the voice air conditioner receives the audio from one end through the sound collection device, there is a blank interval between the two sentences "Xiaomei, Xiaomei, turn up the temperature", and the audio information in the process of the blank interval is empty data, then the blank interval can be divided into two sentences: The number of frames before the data as pending audio.
  • the pre-emphasis process refers to emphasizing the high-frequency part of the audio to increase the high-frequency resolution.
  • the environmental sound information is extracted from the sound information by means of audio recognition. and audio information, remove noise interference, and increase high-frequency resolution to obtain clear vocal information.
  • the embodiment of the present disclosure may also use it to train the wake-up model with noise.
  • the ambient sound information is extracted from the audio to be recognized, it can be sent to the server as training data, and the sound pressure level of the ambient sound information can be used as a feature parameter in the further training of the wake-up model,
  • the wake-up model is trained with noise, so that the recognition process of the wake-up model can adjust the corresponding parameters according to the size of different environmental sound information, such as adjusting the corresponding confidence threshold, so that the wake-up model can be applied to different usage scenarios.
  • S102 Process the audio to be recognized by using the wake-up model and at least two sets of training data to obtain at least two confidence levels and respective confidence thresholds.
  • each set of training data is obtained by training the wake-up model separately from at least two sets of wake-up word training sets.
  • each set of training data may include model parameters and confidence thresholds (wherein, the confidence thresholds may also be referred to as "awakening thresholds").
  • the audio to be recognized is respectively processed by using the wake-up model and at least two sets of training data to obtain at least two confidence levels and corresponding confidence thresholds, which may be include:
  • At least two sets of wake-up word training sets are obtained by grouping different wake-up words, that is, each wake-up word corresponds to a set of wake-up word training sets;
  • the set of wake-up word training sets are obtained through separate training of the wake-up model, that is, there is a corresponding relationship between the training data and the wake-up word, and each of the at least two sets of training data corresponds to a wake-up word. For example, assuming there are wake-up words A and wake-up words B, then a set of wake-up word A training set and a set of wake-up word B training set can be obtained, and the training data of wake-up word A and the training of wake-up word B can be obtained after training the wake-up model data.
  • the audio to be identified can be respectively identified by the training data corresponding to at least two wake-up words, and the audio to be processed is obtained under the model parameters in each set of training data. Confidence. It is worth noting that, unlike in the related art where different wake-up words are trained at the same time, in the embodiment of the present disclosure, both the recognition process and the training process realize separate processing according to different wake-up words, so that the different wake-up words in the recognition result There will be no crosstalk between them, which reduces the false wake-up rate during use; and the training and recognition of wake-up words in this way greatly reduces the working pressure of the processor, reduces the response time, and optimizes the user experience.
  • the wake-up model will combine at least two wake-up words (such as “Xiaomei Xiaomei”, “Xiaomei, hello ”) corresponding to the training data to determine the corresponding confidence level of each wake-up word, and finally obtain the respective confidence levels of the two wake-up words "Xiaomei Xiaomei, Xiaomei Hello".
  • a speech recognition module can be built into the speech device to recognize the audio to be processed.
  • the speech recognition can also be performed through the server through the communication connection between the speech device and the server, and then the specific results can be fed back to the speech device. Used as an input, it can prevent wake-up word crosstalk between multiple voice devices, and the specific method can be adjusted according to the actual situation.
  • the wake-up word may be any preset word, which is not limited in this embodiment.
  • the embodiments of the present disclosure may first perform text conversion processing on the audio to be recognized to obtain audio text information; and then use text matching or semantic
  • the matching method performs matching processing on the audio text information, determines at least one keyword or keyword, and then uses the wake-up model and at least two sets of training data to process it respectively, and details will not be repeated here.
  • the wake-up model and the confidence threshold can be preset in the voice device as factory settings, so that the voice device can have an initial wake-up model and confidence when it is powered on for the first time. Threshold, and training update in the subsequent use process to make it more suitable for user usage scenarios, and there is no limitation here.
  • the at least two groups of training data may include the first group of training data and the second group of training data; wherein, the first group of training data It includes first model parameters and a first confidence threshold, and the second set of training data includes second model parameters and a second confidence threshold.
  • the processing of the audio to be recognized by using the wake-up model and at least two sets of training data to obtain at least two confidence levels and respective corresponding confidence thresholds may include:
  • S103 Trigger a wakeup event of the voice device according to a comparison result between the at least two confidence levels and their corresponding confidence level thresholds.
  • the at least two confidence levels can be compared with their respective corresponding confidence thresholds; then the wake-up of the voice device is triggered according to the comparison results event.
  • the at least two confidence levels only include the first confidence level and the second confidence level.
  • the triggering the wake-up event of the voice device according to the comparison results between the at least two confidence levels and their corresponding confidence thresholds may include:
  • first confidence level is greater than or equal to the first confidence level threshold, or the second confidence level is greater than or equal to the second confidence level threshold, a wakeup event of the voice device is triggered.
  • the wake-up event may include a first wake-up event and/or a second wake-up event; wherein, the first wake-up event has an association relationship with the wake-up words corresponding to the first set of training data, so The second wake-up event is associated with the wake-up words corresponding to the second set of training data.
  • the triggering of the voice device is performed according to a comparison result between the at least two confidence levels and their respective corresponding confidence level thresholds.
  • the wakeup events can also include:
  • first confidence level is greater than or equal to the first confidence level threshold and the second confidence level is less than the second confidence level threshold, triggering the first wakeup event of the voice device; or,
  • the second confidence level is greater than or equal to the second confidence level threshold and the first confidence level is less than the first confidence level threshold, triggering the second wakeup event of the voice device; or,
  • first confidence level is greater than or equal to the first confidence level threshold and the second confidence level is greater than or equal to the second confidence level threshold, calculating the first confidence level exceeds the first confidence level
  • the first value of the threshold and the second confidence level exceed the second value of the second confidence level threshold, and according to the first value and the second value, a target wakeup event of the voice device is triggered.
  • the triggering the target wakeup event of the voice device according to the first value and the second value may include:
  • the target wake-up event is the first wake-up event and trigger
  • the wake-up model corresponds to the training data corresponding to the wake-up word "Xiaomei Xiaomei” and the wake-up word "Classmate Xiaomei”.
  • the training data of "Xiaomei Xiaomei” and the confidence corresponding to "Xiaomei Xiaomei” are respectively obtained, and the two confidence degrees are compared with their corresponding confidence thresholds.
  • the voice device can be made to execute the target wakeup event to perform a corresponding wakeup operation.
  • different wake-up words such as wake-up words with different pronunciations and the same meaning, may generate the same wake-up command.
  • the same wake-up command wakes up the corresponding wake-up event, which can be applied to the wake-up process of different wake-up words of a single voice device, and can also be applied to a cascaded voice central control system composed of multiple voice devices.
  • Embodiments of the present disclosure can Choose according to the needs of the situation, and there is no limitation here.
  • An embodiment of the present disclosure provides a voice processing method, which is applied to a voice device.
  • obtaining the audio to be identified By obtaining the audio to be identified; using the wake-up model and at least two sets of training data to process the audio to be identified respectively, to obtain at least two confidence levels and respective corresponding confidence thresholds; wherein the at least two sets of training data are obtained by At least two sets of wake-up word training sets are obtained by training the wake-up model respectively; according to the comparison results between the at least two confidence levels and their corresponding confidence thresholds, a wake-up event of the voice device is triggered.
  • FIG. 2 shows a schematic flowchart of another wake-up processing method provided by an embodiment of the present disclosure.
  • the method may include:
  • S201 Acquire an initial training set; wherein, the initial training set includes at least two wake-up words.
  • S202 Group the initial training set according to different wake-up words to obtain the at least two groups of wake-up word training sets.
  • S203 Use the at least two sets of wake-up word training sets to train the wake-up model to obtain the at least two sets of training data.
  • the wake-up model may be a neural network model.
  • neural network (Neural Networks, NN) is a complex network system formed by a large number of simple processing units (called “neurons") widely connected to each other, which reflects many basic characteristics of human brain function, is A highly complex nonlinear dynamics learning system.
  • Neural network has large-scale parallelism, distributed storage and processing, self-organization, self-adaptation and self-learning capabilities, and is especially suitable for dealing with imprecise and fuzzy information processing problems that need to consider many factors and conditions at the same time.
  • the wake-up model can be a deep neural network (Deep Neural Networks, DNN) model.
  • the arousal model here can include the structural design of DNN and the mathematical model of each neuron.
  • each set of training data may at least include model parameters and a confidence threshold.
  • the training data here may include optimal parameters obtained after training in the DNN (referred to as "model parameters"), confidence thresholds, and the like.
  • multiple wake-up words can be used to separately train the wake-up model to obtain corresponding training data, so as to realize the separation of wake-up word data without interfering with each other.
  • the multi-model uses multiple models for training separately, in the later use process, due to the time required to wake up the model loading, the switching process will cause a serious delay in recognition; however, the technical solutions provided by the embodiments of the present disclosure are different from multi- The scheme of the model, here is to use different wake-up words to train through the same wake-up model, which also reduces the delay problem in the wake-up process.
  • the training set is divided into groups using different wake-up words as the standard, and the training sets corresponding to different wake-up words are trained separately, and the obtained data is stored independently.
  • the model can be trained with a limited training set and achieve The technical effect of no crosstalk between wake-up words is different from the situation where wake-up words are trained at the same time to wake up the model, which can prevent mutual crosstalk between wake-up words.
  • the method may further include: training the wake-up model according to a set of wake-up word training sets corresponding to the new wake-up word to obtain a new set of wake-up words training data.
  • the existing model can also be trained using the new wake-up word according to the training method in the above-mentioned embodiment.
  • the wake-up model in the above solution the wake-up model is continuously trained, and the wake-up model continuously learns new wake-up words, so that the product can be continuously updated to meet the new needs of users.
  • FIG. 3 shows a schematic diagram of a training process of a wake-up model provided by an embodiment of the present disclosure.
  • FIG. 3 shows a schematic diagram of a training process of a wake-up model provided by an embodiment of the present disclosure.
  • there are two sets of wake-up word training sets such as wake-up word A training set and wake-up word B training set; using the wake-up word A training set to train the wake-up model, the training data of wake-up word A can be obtained; using wake-up The word B training set trains the wake-up model, and the training data of the wake-up word B can be obtained.
  • two training sets of wake-up words are used to train the same wake-up model, and two sets of training data can be obtained.
  • the input initial training set is divided into different groups with different wake-up words, and each group of wake-up word training sets is individually used as input data to train the wake-up model until the training of all wake-up word training sets ends.
  • the training data obtained after training the wake-up model for each set of wake-up word training sets are partitioned and stored according to different wake-up words.
  • the division of the wake-up word training set is grouped by different wake-up words, where different wake-up words can be wake-up words with completely different semantics, or wake-up words with the same semantics but in different dialect contexts, for example Cantonese (xiu mei xiu mei) and Mandarin (xiao mei xiao mei) for the wake word Xiaomei.
  • different wake-up words can be wake-up words with completely different semantics, or wake-up words with the same semantics but in different dialect contexts, for example Cantonese (xiu mei xiu mei) and Mandarin (xiao mei xiao mei) for the wake word Xiaomei.
  • Cantonese xiu mei xiu mei
  • Mandarin xiao mei xiao mei
  • the input information is the audio to be processed, and a voice recognition module can be built in the wake-up model to recognize the audio to be recognized to output the corresponding wake-up event; or a voice recognition module can be set in the voice device to obtain the audio to be recognized from the sound information. Audio, and perform speech recognition on the audio to be recognized, so as to output the corresponding wake-up event.
  • the specific manner of inputting information can be selected according to actual needs, and is not limited in any way.
  • An embodiment of the present disclosure provides a voice processing method, which is applied to a voice device.
  • each set of training data includes model parameters and confidence threshold.
  • FIG. 4 shows a detailed flowchart of a wake-up processing method provided by an embodiment of the present disclosure. Taking the existence of wake-up word A and wake-up word B as an example, as shown in Figure 4, the method may include:
  • the microphone collects audio in real time, and obtains the audio to be recognized after the front-end preprocessing.
  • S402 Process the audio to be recognized by using the wake-up model and the training data of the wake-up word A to obtain a confidence degree A and a corresponding confidence degree threshold A.
  • S403 Process the audio to be recognized by using the wake-up model and the training data of the wake-up word B to obtain a confidence level B and a corresponding confidence level threshold B.
  • S404 Determine whether confidence degree A ⁇ confidence degree threshold A, or confidence degree B ⁇ confidence degree threshold B.
  • step S404 if the judgment result is yes, then S405 can be executed, and after waking up the voice device, it can also return to step S401 to continue the next audio collection; if the judgment result is no, then it can Directly return to step S401 to continue the next audio collection.
  • the embodiment of the present disclosure adopts a single wake-up model, and different wake-up words are trained separately to obtain independent training data, so that the training data of the wake-up words are separated without interfering with each other.
  • the related process is as follows:
  • the training data of wake-up model and wake-up word A, the training data of wake-up word B are stored in speech module and are used for wake-up identification;
  • the microphone collects the audio in real time, and obtains the audio to be recognized after the front-end processing
  • the audio to be identified is processed using the training data of the wake-up model + wake-up word A to obtain a confidence degree A and a confidence degree threshold A;
  • the audio to be recognized is processed using the training data of the wake-up model + wake-up word B to obtain confidence B and confidence threshold B;
  • each group of wake-up word training sets is used for separate training, so there is no possibility of crosstalk with other wake-up words, and a better recognition effect can be achieved with less training.
  • the scheme of the embodiment of the present disclosure can achieve a small false wakeup rate under the condition of simultaneous identification of multiple wakeup words, while the scheme of the related art trains the wakeup words at the same time, and the false wakeup test
  • the enterprise standard requirement is less than 3 times in 24 hours; according to the embodiment of the present disclosure, the wake-up words are trained separately, and the false wake-up test can be done less than once in 72 hours.
  • new wake-up words are added in the solution of the embodiment of the present disclosure, due to data separation, retraining only needs to be performed for the newly-added wake-up words without affecting existing wake-up words, and the development efficiency is also improved.
  • the embodiment of the present disclosure provides a wake-up processing method.
  • the specific implementation of the aforementioned embodiment is described in detail through the above-mentioned embodiment.
  • the possibility of wake-up word crosstalk occurs when the wake-up word is separated, and the training data of the wake-up word is separated without interfering with each other, and in the case of simultaneous recognition of multiple wake-up words, it can also reduce the false wake-up rate of the voice device.
  • FIG. 5 shows a schematic structural diagram of a wake-up processing device provided by an embodiment of the present disclosure.
  • the wake-up processing device 50 may include: an acquisition unit 501, a processing unit 502, and a trigger unit 503; wherein,
  • the acquiring unit 501 is configured to acquire the audio to be identified
  • the processing unit 502 is configured to use the wake-up model and at least two sets of training data to respectively process the audio to be recognized to obtain at least two confidence levels and corresponding confidence thresholds; wherein the at least two sets of training data are obtained by At least two sets of wake-up word training sets are obtained by training the wake-up model respectively;
  • the triggering unit 503 is configured to trigger a wake-up event of the voice device according to a comparison result between the at least two confidence levels and respective corresponding confidence level thresholds.
  • the obtaining unit 501 is specifically configured to collect data through a sound collection device to obtain initial voice data; and preprocess the initial voice data to obtain the audio to be recognized.
  • each set of training data includes model parameters and a confidence threshold; correspondingly, the processing unit 502 is specifically configured to use the wake-up model and the model parameters in the at least two sets of training data to respectively analyze the The audio to be recognized is processed to obtain at least two confidence levels, and confidence thresholds corresponding to the at least two confidence levels are obtained from the at least two sets of training data.
  • the at least two sets of training data include a first set of training data and a second set of training data, the first set of training data includes first model parameters and a first confidence threshold, and the second set The training data includes a second model parameter and a second confidence threshold; correspondingly, the processing unit 502 is specifically configured to use the wake-up model and the first model parameter in the first set of training data to identify the audio processing to obtain a first confidence level, and determine the first confidence level threshold corresponding to the first confidence level from the first set of training data; and use the wake-up model and the second set of training data
  • the second model parameters in the data process the audio to be recognized to obtain a second confidence level, and determine the second confidence level threshold corresponding to the second confidence level from the second set of training data .
  • the triggering unit 503 is specifically configured to: if the first confidence level is greater than or equal to the first confidence level threshold, or the second confidence level is greater than or equal to the second confidence level threshold, then Trigger the wakeup event of the voice device.
  • the wake-up event includes a first wake-up event and/or a second wake-up event; wherein, the first wake-up event is associated with a wake-up word corresponding to the first set of training data, and the second wake-up event The two wake-up events are associated with the wake-up words corresponding to the second set of training data.
  • the triggering unit 503 is specifically configured to trigger the The first wakeup event of the voice device; or, if the second confidence level is greater than or equal to the second confidence level threshold and the first confidence level is less than the first confidence level threshold, triggering the voice the second wake-up event of the device; or, if the first confidence level is greater than or equal to the first confidence level threshold and the second confidence level is greater than or equal to the second confidence level threshold, then calculating the The first confidence level exceeds the first value of the first confidence level threshold and the second confidence level exceeds the second value of the second confidence level threshold, according to the first value and the second value, triggering The target wakeup event for the voice device.
  • the triggering unit 503 is further configured to determine that the target wakeup event is the first wakeup event and trigger if the first value is greater than or equal to the second value; or, if the If the first value is smaller than the second value, it is determined that the target wake-up event is the second wake-up event and triggered.
  • the acquiring unit 501 is further configured to acquire the at least two sets of wake-up word training sets;
  • the processing unit 502 is further configured to use the at least two sets of wake-up word training sets to train the wake-up model to obtain the at least two sets of training data; wherein each set of training data includes model parameters and confidence thresholds.
  • the acquiring unit 501 is further configured to acquire an initial training set; wherein, the initial training set includes at least two wake-up words; and group the initial training set according to different wake-up words to obtain the At least two sets of wake word training sets.
  • a "unit” may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a module, or it may be non-modular.
  • each component in this embodiment may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software function modules.
  • the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of this embodiment is essentially or It is said that the part that contributes to the prior art or the whole or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions to make a computer device (which can It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the method described in this embodiment.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program codes.
  • this embodiment provides a computer storage medium, the computer storage medium stores a wake-up processing program, and when the wake-up processing program is executed by at least one processor, the steps of the method described in any one of the preceding embodiments are implemented.
  • FIG. 6 shows a schematic diagram of a specific hardware structure of the wake-up processing device 50 provided by an embodiment of the present disclosure.
  • it may include: a communication interface 601 , a memory 602 and a processor 603 ; each component is coupled together through a bus system 604 .
  • the bus system 604 is used to realize connection and communication between these components.
  • the bus system 604 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 604 in FIG. 6 .
  • the communication interface 601 is used for receiving and sending signals during the process of sending and receiving information with other external network elements;
  • memory 602 used to store computer programs that can run on the processor 603;
  • the processor 603 is configured to, when running the computer program, execute:
  • a wakeup event of the voice device is triggered.
  • the memory 602 in the embodiment of the present disclosure may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDRSDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous chain dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM, DRRAM
  • Memory 602 of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
  • the processor 603 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 603 or an instruction in the form of software.
  • the above-mentioned processor 603 may be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the methods disclosed in the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 602, and the processor 603 reads the information in the memory 602, and completes the steps of the above method in combination with its hardware.
  • the processing unit can be implemented in one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processor (Digital Signal Processing, DSP), digital signal processing device (DSP Device, DSPD), programmable Logic device (Programmable Logic Device, PLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), general purpose processor, controller, microcontroller, microprocessor, other devices for performing the functions described in this disclosure electronic unit or its combination.
  • ASIC Application Specific Integrated Circuits
  • DSP Digital Signal Processing
  • DSP Device digital signal processing device
  • DSPD digital signal processing device
  • PLD programmable Logic Device
  • Field-Programmable Gate Array Field-Programmable Gate Array
  • FPGA Field-Programmable Gate Array
  • the techniques described herein can be implemented through modules (eg, procedures, functions, and so on) that perform the functions described herein.
  • Software codes can be stored in memory and executed by a processor.
  • Memory can be implemented within the processor or external to the processor.
  • the processor 603 is further configured to execute the steps of the method described in any one of the foregoing embodiments when running the computer program.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Automation & Control Theory (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Telephonic Communication Services (AREA)
  • Toys (AREA)

Abstract

一种唤醒处理方法、装置、设备和计算机存储介质,应用于语音设备,该方法包括:获取待识别音频(S101);利用唤醒模型和至少两组训练数据分别对待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值(S102);其中,至少两组训练数据是由至少两组唤醒词训练集通过唤醒模型分别训练得到的;根据至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发语音设备的唤醒事件(S103)。

Description

一种唤醒处理方法、装置、设备和计算机存储介质
相关申请的交叉引用
本公开要求于2021年08月06日提交的申请号为202110904169.X,名称为“一种唤醒处理方法、装置、设备和计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及语音识别技术领域,尤其涉及一种唤醒处理方法、装置、设备和计算机存储介质。
背景技术
随着语音识别技术的发展,家居智能化已经成为趋势,语音设备也逐渐渗入到人们的日常生活中。目前,很多用户家庭普遍存在多种品类的语音设备,在对语音设备进行语音控制之前,需要先对语音设备进行唤醒操作。
然而,在相关技术中,这些语音设备普遍存在需要识别多个唤醒词的情况,这些不同唤醒词同时训练,容易在不同唤醒词之间发生串扰,进而发生误唤醒的问题,增加了语音识别的误唤醒率。
公开内容
本公开期望提供一种唤醒处理方法、装置、设备和计算机存储介质,能够避免不同唤醒词同时训练时发生唤醒词串扰的可能,降低语音设备的误唤醒率。
为达到上述目的,本公开的技术方案是这样实现的:
第一方面,本公开实施例提供了一种唤醒处理方法,应用于语音设备,所述方法包括:
获取待识别音频;
利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;
根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件。
在一些实施例中,所述获取待识别音频,包括:
通过声音采集装置进行数据采集,获取初始语音数据;
对所述初始语音数据进行预处理,得到所述待识别音频。
在一些实施例中,每一组训练数据包括模型参数和置信度阈值;所述利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值,包括:
利用所述唤醒模型和所述至少两组训练数据中的模型参数分别对所述待识别音频进行处理,得到至少两个置信度,以及从所述至少两组训练数据中获取所述至少两个置信度各自对应的置信度阈值。
在一些实施例中,所述至少两组训练数据包括第一组训练数据和第二组训练数据,所述第一组训练数据包括第一模型参数和第一置信度阈值,所述第二组训练数据包括第二模型参数和第二置信度阈值;
所述利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值,包括:
利用所述唤醒模型和所述第一组训练数据中的所述第一模型参数对所述待识别音频进行处理,得到第一置信度,并从所述第一组训练数据中确定所述第一置信度对应的所述第一置信度阈值;以及
利用所述唤醒模型和所述第二组训练数据中的所述第二模型参数对所述待识别音频进行处理,得到第二置信度,并从所述第二组训练数据中确定所述第二置信度对应的所述第二置信度阈值。
在一些实施例中,所述根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件,包括:
若所述第一置信度大于或等于所述第一置信度阈值,或者所述第二置信度大于或等于所述第二置信度阈值,则触发所述语音设备的唤醒事件。
在一些实施例中,所述唤醒事件包括第一唤醒事件和/或第二唤醒事件;其中,所述第一唤醒事件与所述第一组训练数据对应的唤醒词具有关联关系,所述第二唤醒事件与所述第二组训练数据对应的唤醒词具有关联关系。
在一些实施例中,所述根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件,包括:
若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度小于所述第二置信度阈值,则触发所述语音设备的所述第一唤醒事件;或者,
若所述第二置信度大于或等于所述第二置信度阈值且所述第一置信度小于所述第 一置信度阈值,则触发所述语音设备的所述第二唤醒事件;或者,
若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度大于或等于所述第二置信度阈值,则计算所述第一置信度超过所述第一置信度阈值的第一值和所述第二置信度超过所述第二置信度阈值的第二值,根据所述第一值和所述第二值,触发所述语音设备的目标唤醒事件。
在一些实施例中,所述根据所述第一值和所述第二值,触发所述语音设备的目标唤醒事件,包括:
若所述第一值大于或等于所述第二值,则确定所述目标唤醒事件为所述第一唤醒事件并触发;或者,
若所述第一值小于所述第二值,则确定所述目标唤醒事件为所述第二唤醒事件并触发。
在一些实施例中,所述方法还包括:
获取所述至少两组唤醒词训练集;
利用所述至少两组唤醒词训练集对所述唤醒模型进行训练,得到所述至少两组训练数据;其中,每一组训练数据包括模型参数和置信度阈值。
在一些实施例中,所述获取所述至少两组唤醒词训练集,包括:
获取初始训练集;其中,所述初始训练集中包括至少两个唤醒词;
对所述初始训练集按照不同的唤醒词进行分组,得到所述至少两组唤醒词训练集。
第二方面,本公开实施例提供了一种唤醒处理装置,应用于语音设备,所述唤醒处理装置包括获取单元、处理单元和触发单元;其中,
所述获取单元,配置为获取待识别音频;
所述处理单元,配置为利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;
所述触发单元,配置为根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件。
第三方面,本公开实施例提供了一种语音设备,所述语音设备包括存储器和处理器;其中,
所述存储器,用于存储能够在所述处理器上运行的计算机程序;
所述处理器,用于在运行所述计算机程序时,执行如第一方面中任一项所述的方法。
第四方面,本公开实施例提出了一种计算机存储介质,所述计算机存储介质存储有 计算机程序,所述计算机程序被至少一个处理器执行时实现如第一方面中任一项所述的方法。
本公开实施例提供了一种唤醒处理方法、装置、设备和计算机存储介质,获取待识别音频;利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,确定待唤醒设备。这样,通过使用同一个唤醒模型将多个唤醒词分开训练,避免了与其他唤醒词发生串扰的可能,同时可以使用较少的训练量达到更好的识别效果;另外,由于还实现了唤醒词的训练数据分离且互不干扰,从而还可以提高开发效率,而且在多个唤醒词同时识别的情况下,还能够降低语音设备的误唤醒率。
附图说明
图1为本公开实施例提供的一种唤醒处理方法的流程示意图;
图2为本公开实施例提供的另一种唤醒处理方法的流程示意图;
图3为本公开实施例提供的一种唤醒模型的训练过程示意图;
图4为本公开实施例提供的一种唤醒处理方法的详细流程示意图;
图5为本公开实施例提供的一种唤醒处理装置的组成结构示意图;
图6为本公开实施例提供的一种语音设备的具体硬件结构示意图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。可以理解的是,此处所描述的具体实施例仅仅用于解释相关公开,而非对该公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关公开相关的部分。
除非另有定义,本文所使用的所有的技术和科学术语与属于本公开的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本公开实施例的目的,不是旨在限制本公开。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
需要指出,本公开实施例所涉及的术语“第一\第二\第三”仅是用于区别类似的对 象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本公开实施例能够以除了在这里图示或描述的以外的顺序实施。
在实际应用中,目前可以采样的唤醒模型的语音识别方案可以有两种:(1)多个唤醒词在同一模型训练;(2)采用多个模型分别训练唤醒词。
然而,对于(1)这种技术方案,多个唤醒词在同一模型训练,不同唤醒词之间由于相似度的关系容易发生串扰,考虑到唤醒的相应速率和存储空间问题,训练集不能太大,导致介于不同唤醒词两者间的音容易误识别,容易误唤醒被投诉。对于(2)这种技术方案,模型加载需要时间,而且切换会导致延迟比较严重,从而无法满足同时识别多个唤醒词的方案。简言之,在相关技术中,这些语音设备普遍存在需要识别多个唤醒词的情况,这些不同唤醒词同时训练,容易在不同唤醒词之间发生串扰,进而发生误唤醒的问题,增加了语音识别的误唤醒率。
基于此,本公开实施例提供了一种唤醒处理方法,该方法的基本思想是:获取待识别音频;利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,确定待唤醒设备。这样,通过使用同一个唤醒模型将多个唤醒词分开训练,避免了与其他唤醒词发生串扰的可能,同时可以使用较少的训练量达到更好的识别效果;另外,由于还实现了唤醒词的训练数据分离且互不干扰,从而还可以提高开发效率,而且在多个唤醒词同时识别的情况下,还能够降低语音设备的误唤醒率。
下面将结合附图对本公开各实施例进行详细说明。
实施例一
参见图1,其示出了本公开实施例提供一种唤醒处理方法的流程示意图。如图1所示,该方法可以包括:
S101:获取待识别音频。
需要说明的是,本公开实施例的唤醒处理方法应用于唤醒处理装置,或者集成该唤醒处理装置的语音设备。这里,语音设备可以与用户进行语音交互,而且是任何需要语音唤醒的待唤醒设备,如语音空调,语音热水器,语音电饭煲、语音微波炉等任意常见的家电设备,但是不作任何限定。
还需要说明的是,由于语音设备可以与用户进行语音交互,那么这时候还可以通过 声音采集装置进行数据采集。因此,在一些实施例中,对于步骤S101来说,所述获取待识别音频,可以包括:
通过声音采集装置进行数据采集,获取初始语音数据;
对所述初始语音数据进行预处理,得到所述待识别音频。
在本公开实施例中,声音采集装置可以为麦克风、话筒等音频采集器。具体来说,可以通过麦克风的实时数据采集,获取来自用户的初始语音数据;然后对所述初始语音数据进行预处理后得到待识别音频。
可以理解的是,本公开实施例中的初始语音数据包括用户的声音信息,对于仅有环境声的情况,由于不涉及唤醒识别,因此并不在本实施例讨论范围内,这里不再赘述。也就是说,初始语音数据可以是用户发出的,例如可以是“小美小美”;当语音设备获取到声音信息后,对信息进行的预处理。
在这里,所述预处理可以包括端点检测过程和预加重过程等两个方面,下面将分别对其进行详细说明。
在一种可能的实施方式中,端点检测过程是指找到指令音频的起始点和结束点,可以从声音信息中截取连续的若干帧声音片段,并按照声音片段的排序,排列在前的若干帧设置为待识别音频,具体的设置为待识别音频的帧数可以根据设定的唤醒词的长度确定。例如,根据唤醒词的文字数量预先设定好具体的时长,将该时长内的声音片段确定为待识别音频,具体的时长可以根据实际情况调整,本实施例不作任何限定。
或者,待处理音频的帧数也可以根据检测两段连续的声音片段之间的空数据的长度确定,例如实际使用过程中,用户可能会先喊出唤醒词,停顿若干秒之后喊出其余语音指令,则可以将空数据之前的片段作为待识别音频。
示例性地,以语音空调为例,结合上述实施例,假设语音空调通过声音采集装置接收到一段音频“小美小美”,“小美小美”这个唤醒词在预设时长为两秒,那在端点检测过程中需截取前2秒时长所对应的帧数作为待处理音频。或者,假设语音空调通过声音采集装置收到一端音频“小美小美,调高温度”两句之间存在空白间隔,空白间隔的过程中音频信息为空数据,则可将空白间隔这段空数据之前的帧数作为待处理音频。
在另一种可能的实施方式中,预加重过程是指对音频高频部分进行加重,增加高频分辨率,当获取到声音信息后,采用音频识别的方式从声音信息中提取出环境声信息和音频信息,去除杂音干扰,并增加高频分辨率,获取清晰的人声信息。
需要说明的是,对于携带有环境声的待识别音频而言,本公开实施例还可以利用其进行带噪声的唤醒模型训练。具体地,在从待识别音频中提取出环境声信息之后,可以 将其发送至服务器中用作训练数据,在对唤醒模型的进一步训练中能够将环境声信息的声压级作为一个特征参数,对唤醒模型采用带噪声训练的方式,使得唤醒模型的识别过程能够根据不同的环境声信息的大小调整对应的参数,例如调整对应的置信度阈值,使得唤醒模型能够适用于不同的使用场景。
S102:利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值。
在本公开实施例中,所述至少两组训练数据是由至少两组唤醒词训练集通过唤醒模型分别训练得到的。其中,每一组训练数据可以包括模型参数和置信度阈值(其中,置信度阈值也可称为“唤醒阈值”)。
相应地,在一些实施例中,对于S102来说,所述利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值,可以包括:
利用所述唤醒模型和所述至少两组训练数据中的模型参数分别对所述待识别音频进行处理,得到至少两个置信度,以及从所述至少两组训练数据中获取所述至少两个置信度各自对应的置信度阈值。
需要说明的是,至少两组唤醒词训练集是根据不同的唤醒词进行分组得到的,即每一个唤醒词对应一组唤醒词训练集;另外,这至少两组训练数据又是由这至少两组唤醒词训练集通过唤醒模型分别训练得到的,即训练数据与唤醒词之间具有对应关系,这至少两组训练数据各自对应一个唤醒词。例如,假设存在唤醒词A和唤醒词B,那么可以得到一组唤醒词A训练集和一组唤醒词B训练集,并且通过唤醒模型训练之后得到唤醒词A的训练数据和唤醒词B的训练数据。
这样,针对待识别音频进行处理的过程,可将待识别音频由至少两个唤醒词各自对应的训练数据分别进行识别,并得到待处理音频在每一组训练数据中的模型参数下所得到的置信度。值得注意的是,与相关技术中将不同唤醒词的同时进行训练不同,在本公开实施例中,识别过程和训练过程都实现了按照不同唤醒词的分离处理,使得识别结果中不同唤醒词之间不会发生串扰,降低了使用过程中的误唤醒率;而且这样分唤醒词进行训练和识别,大大减少了处理器的工作压力,也减少了响应时间,优化了用户的使用体验。
示例性地,以语音空调为例,假设语音空调的接收到待识别音频“小美小美”,那么唤醒模型会结合至少两个唤醒词(例如“小美小美”、“小美你好”)对应的训练数据进行针对每个唤醒词相对应置信度的确定,最后得到“小美小美、小美你好”两个唤 醒词各自对应的置信度。
可以理解的是,语音设备中可以内置语音识别模块,以对待处理音频进行识别,当然,也可以通过语音设备与服务器通信连接的方式,通过服务器进行语音识别,再反馈具体的结果至语音设备中用作输入,能够防止多个语音设备间的唤醒词串扰,具体方式根据实际情况调整即可。还可以理解的是,唤醒词可以是预先设置的任意文字,本实施例不作任何限定。
另外,针对待识别音频,本公开实施例在利用唤醒模型和至少两组训练数据分别对其进行处理之前,还可以先对待识别音频进行文本转换处理,得到音频文本信息;然后通过文字匹配或者语义匹配的方式对音频文本信息进行匹配处理,确定至少一个关键词或关键字,然后在利用唤醒模型和至少两组训练数据分别对其进行处理,在此不多作赘述。
除此之外,在本公开实施例中,唤醒模型和置信度阈值可以是以出厂设置的方式预先设置于语音设备中,从而使语音设备初次上电使用时能够有初始的唤醒模型和置信度阈值,并在后续使用过程中进行训练更新,使其更加符合用户使用场景,这里也不作任何限定。
进一步地,假设这里的至少两组训练数据为两组的情况,那么这时候所述至少两组训练数据可以包括第一组训练数据和第二组训练数据;其中,所述第一组训练数据包括第一模型参数和第一置信度阈值,所述第二组训练数据包括第二模型参数和第二置信度阈值。
相应地,在一些实施例中,所述利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值,可以包括:
利用所述唤醒模型和所述第一组训练数据中的所述第一模型参数对所述待识别音频进行处理,得到第一置信度,并从所述第一组训练数据中确定所述第一置信度对应的所述第一置信度阈值;以及
利用所述唤醒模型和所述第二组训练数据中的所述第二模型参数对所述待识别音频进行处理,得到第二置信度,并从所述第二组训练数据中确定所述第二置信度对应的所述第二置信度阈值。
也就是说,假设存在唤醒词A和唤醒词B,那么在得到唤醒词A的训练数据和唤醒词B的训练数据之后,可以利用唤醒模型和唤醒词A的训练数据对待识别音频进行处理,得到唤醒词A置信度以及对应的置信度阈值;以及可以利用唤醒模型和唤醒词B的训练数据对待识别音频进行处理,得到唤醒词B置信度以及对应的置信度阈值,从而得到这 两组训练数据各自对应的置信度和置信度阈值,以便后续通过比较确定待触发的唤醒事件。
S103:根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发语音设备的唤醒事件。
需要说明的是,在得到至少两个置信度以及各自对应的置信度阈值之后,可以将这至少两个置信度与各自对应的置信度阈值分别进行比较;然后根据比较结果来触发语音设备的唤醒事件。
具体来讲,以两个唤醒词的情况为例,这时候至少两个置信度仅包括第一置信度和第二置信度。在一些实施例中,对于S103来说,所述根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发语音设备的唤醒事件,可以包括:
若所述第一置信度大于或等于所述第一置信度阈值,或者所述第二置信度大于或等于所述第二置信度阈值,则触发所述语音设备的唤醒事件。
在本公开实施例中,所述唤醒事件可以包括第一唤醒事件和/或第二唤醒事件;其中,所述第一唤醒事件与所述第一组训练数据对应的唤醒词具有关联关系,所述第二唤醒事件与所述第二组训练数据对应的唤醒词具有关联关系。
在一些实施例中,当至少两个置信度包括第一置信度和第二置信度时,所述根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发语音设备的唤醒事件,还可以包括:
若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度小于所述第二置信度阈值,则触发所述语音设备的所述第一唤醒事件;或者,
若所述第二置信度大于或等于所述第二置信度阈值且所述第一置信度小于所述第一置信度阈值,则触发所述语音设备的所述第二唤醒事件;或者,
若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度大于或等于所述第二置信度阈值,则计算所述第一置信度超过所述第一置信度阈值的第一值和所述第二置信度超过所述第二置信度阈值的第二值,根据所述第一值和所述第二值,触发所述语音设备的目标唤醒事件。
需要说明的是,针对两个置信度均大于或等于相对应的置信度阈值的情况,这时候需要计算所述第一置信度超过所述第一置信度阈值的第一值和所述第二置信度超过所述第二置信度阈值的第二值。在一些实施例中,所述根据所述第一值和所述第二值,触发所述语音设备的目标唤醒事件,可以包括:
若所述第一值大于或等于所述第二值,则确定所述目标唤醒事件为所述第一唤醒事 件并触发;或者,
若所述第一值小于所述第二值,则确定所述目标唤醒事件为所述第二唤醒事件并触发。
示例性地,如果待唤醒设备的声音采集装置接收到内容为“小美同学”的待处理音频,唤醒模型配合唤醒词“小美小美”对应的训练数据和唤醒词“小美同学”对应的训练数据,分别得到“小美小美”对应的置信度和“小美同学”对应的置信度,并且将两个置信度分别与其相对应的置信度阈值相比较,如果“小美小美”的置信度大于或等于其对应的置信度阈值,则“小美小美”对应的唤醒事件为目标唤醒事件;否则,如果“小美同学”的置信度大于或等于其对应的置信度阈值,则“小美同学”对应的唤醒事件为目标唤醒事件。
需要说明的是,在特殊状况下,如果“小美小美”和“小美同学”的置信度均大于等于其相应的置信度阈值,那么可以比较置信度超出置信度阈值的量,两个唤醒词中置信度超出置信度阈值的量更多的一个唤醒词对应的唤醒事件为目标唤醒事件。这样,在确定出目标唤醒事件之后,可以使得语音设备执行该目标唤醒事件以进行对应唤醒操作。
还需要说明的是,不同的唤醒词,例如不同发音相同含义的唤醒词,可以生成同一个唤醒指令。其中,相同的唤醒指令唤醒相应的唤醒事件,既可以应用于单一语音设备不同唤醒词的唤醒过程,又可以应用于多个语音设备组成的级联式的语音中控系统,本公开实施例可根据情况需要进行选择,这里不作任何限定。
本公开实施例提供了一种语音处理方法,应用于语音设备。通过获取待识别音频;利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件。这样,通过使用同一个唤醒模型将多个唤醒词分开训练,避免了与其他唤醒词发生串扰的可能,同时可以使用较少的训练量达到更好的识别效果;另外,由于还实现了唤醒词的训练数据分离且互不干扰,从而还可以提高开发效率,而且在多个唤醒词同时识别的情况下,还能够降低语音设备的误唤醒率。
实施例二
基于前述实施例相同的发明构思,参见图2,其示出了本公开实施例提供的另一种唤醒处理方法的流程示意图。如图2所示,该方法可以包括:
S201:获取初始训练集;其中,所述初始训练集中包括至少两个唤醒词。
S202:对所述初始训练集按照不同的唤醒词进行分组,得到所述至少两组唤醒词训练集。
S203:利用所述至少两组唤醒词训练集对所述唤醒模型进行训练,得到所述至少两组训练数据。
需要说明的是,在本公开实施例中,唤醒模型可以是神经网络模型。其中,神经网络(Neural Networks,NN)是由大量的、简单的处理单元(称为“神经元”)广泛地互相连接而形成的复杂网络系统,它反映了人脑功能的许多基本特征,是一个高度复杂的非线性动力学习系统。神经网络具有大规模并行、分布式存储和处理、自组织、自适应和自学能力,特别适合处理需要同时考虑许多因素和条件的、不精确和模糊的信息处理问题。在这里,唤醒模型可以为深度神经网络(Deep Neural Networks,DNN)模型。具体来说,这里的唤醒模型可以包括DNN的结构设计和每个神经元的数学模型。
需要说明的是,在本公开实施例中,每一组训练数据至少可以包括模型参数和置信度阈值。具体来说,这里的训练数据可以包括DNN中训练后所得的最优参数(简称为“模型参数”)和置信度阈值等。
还需要说明的是,本公开实施例可以采用多个唤醒词分开训练唤醒模型以获得相对应的训练数据,从而实现唤醒词数据分离互不干扰的方式。另外,由于多模型是采用多个模型分别进行训练,在后期使用过程中,由于唤醒模型加载需要时间,切换过程会导致识别的延迟较为严重;然而,本公开实施例提供的技术方案不同于多模型的方案,这里是采用不同唤醒词通过同一个唤醒模型进行训练,从而还降低了唤醒处理过程中的延迟问题。
还需要说明的是,用不同的唤醒词为标准进行分组划分训练集,而且不同的唤醒词对应的训练集单独训练,得到的数据独立存储,能够用有限的训练集对模型进行训练,并且达到唤醒词之间不产生串扰的技术效果,不同于唤醒词同时训练唤醒模型的情况,能够防止唤醒词之间的相互串扰。
进一步地,当需要增加新的唤醒词时,在一些实施例中,该方法还可以包括:根据新的唤醒词对应的一组唤醒词训练集对所述唤醒模型进行训练,得到新的一组训练数据。
也就是说,由于本公开实施例实现了数据分离,重新训练只需要针对新增加的唤醒词,并不会影响已有的唤醒词;这样在有增加新的唤醒词的情况下,还可以提高开发效率。
换言之,如果是已经投入使用的唤醒模型,在需要增加新的唤醒词的情况下,也可按照上述实施例中的训练方法,利用新的唤醒词对已有模型进行训练,其中,已经投入使用的唤醒模型作为上述方案中的唤醒模型,如此不断对唤醒模型进行训练,而且唤醒模型不断学习新的唤醒词,使得产品能持续不断地更新进而满足用户的新需求。
示例性地,以存在唤醒词A和唤醒词B为例,参见图3,其示出了本公开实施例提供的一种唤醒模型的训练过程示意图。在图3中,这里存在两组唤醒词训练集,例如唤醒词A训练集和唤醒词B训练集;使用唤醒词A训练集对唤醒模型进行训练,可以得到唤醒词A的训练数据;使用唤醒词B训练集对唤醒模型进行训练,可以得到唤醒词B的训练数据。
具体来讲,本公开实施例利用两个唤醒词训练集对同一个唤醒模型进行训练,可以得到两组训练数据。这里,将输入的初始训练集以不同的唤醒词划分为不同的组,每一组唤醒词训练集依次单独作为输入数据对唤醒模型进行训练,直至所有唤醒词训练集的训练结束。另外,需要注意的是,每一组唤醒词训练集对唤醒模型进行训练后所得到的训练数据按照不同的唤醒词进行分区存储。
示例性地,唤醒词训练集的划分是通过不同唤醒词进行分组的,这里不同的唤醒词既可以是语义完全不同的唤醒词,也可以是相同语义但不同方言语境中的唤醒词,例如唤醒词小美小美的粤语(xiu mei xiu mei)和普通话(xiao mei xiao mei)。这样,在利用唤醒模型进行识别处理时,可以是输入若干个输入信息以得出一个输出信息。其中,输入信息是待处理音频,可以在唤醒模型中内置语音识别模块,对待识别音频进行识别以输出对应的唤醒事件;也可以是在语音设备中设置语音识别模块,从声音信息中获取待识别音频,并对待识别音频进行语音识别,以输出对应的唤醒事件。在本公开实施例中,输入信息的具体方式根据实际需求选取即可,并不作任何限定。
本公开实施例提供了一种语音处理方法,应用于语音设备。通过获取所述至少两组唤醒词训练集;利用所述至少两组唤醒词训练集对所述唤醒模型进行训练,得到所述至少两组训练数据;其中,每一组训练数据包括模型参数和置信度阈值。这样,可以避免不同唤醒词同时训练时发生唤醒词串扰的可能,实现了唤醒词的相互分离以及训练数据的相互分离,从而可以在多个唤醒词同时识别的情况下,能够降低语音设备的误唤醒率。
实施例三
基于前述实施例相同的发明构思,参见图4,其示出了本公开实施例提供的一种唤醒处理方法的详细流程示意图。以存在唤醒词A和唤醒词B为例,如图4所示,该方法可以包括:
S401:麦克风实时采集音频,经前端预处理后得到待识别音频。
S402:利用唤醒模型和唤醒词A的训练数据对所述待识别音频进行处理,得到置信度A和对应的置信度阈值A。
S403:利用唤醒模型和唤醒词B的训练数据对所述待识别音频进行处理,得到置信度B和对应的置信度阈值B。
S404:判断是否置信度A≥置信度阈值A,或者置信度B≥置信度阈值B。
S405:若判断结果为是,触发语音设备的唤醒事件。
需要说明的是,对于步骤S404来说,如果判断结果为是,那么可以执行S405,在唤醒语音设备后,还可以返回步骤S401继续进行下一次的音频的采集;如果判断结果为否,那么可以直接返回步骤S401继续进行下一次的音频的采集。
在本公开实施例中,本公开实施例采用单一的唤醒模型,而且不同的唤醒词分开训练获得独立的训练数据,从而实现唤醒词的训练数据分离互不干扰。
在一种可能的实施例中,其相关流程如下:
(1)多个唤醒词设计使用同一个唤醒模型;
(2)使用唤醒词A训练集训练唤醒模型得到唤醒词A的训练数据;
(3)使用唤醒词B训练集训练唤醒模型得到唤醒词B的训练数据;
(4)将唤醒模型和唤醒词A的训练数据、唤醒词B的训练数据存储在语音模块用于唤醒识别;
(5)麦克风实时采集音频,经前端处理后得到待识别音频;
(6)待识别音频使用唤醒模型+唤醒词A的训练数据处理,得到置信度A和置信度阈值A;
(7)待识别音频使用唤醒模型+唤醒词B的训练数据处理,得到置信度B和置信度阈值B;
(8)置信度A≥置信度阈值A,或者置信度B≥置信度阈值B,则触发唤醒事件。
在另一种可能的实施例中,针对(8)的处理步骤来说,其还可以采用如下方式:
(1)置信度A≥置信度阈值A且置信度B<置信度阈值B,则触发A唤醒事件;
(2)置信度A<置信度阈值A且置信度B≥置信度阈值B,则触发B唤醒事件;
(3)置信度A≥置信度阈值A且置信度B≥置信度阈值B,则根据超出置信度阈值的百分比值进行综合判断,触发唤醒事件。
还需要说明的是,在相关技术的方案中,多个唤醒词同时训练,不同唤醒词之间会有串扰,即在训练度不够的情况下环境噪音中出现两唤醒词的模糊音会被误判,尤其是 存在重复字时(比如“小美小美”和“小美同学”),需要靠模型设计和大量的训练集去区分开每个唤醒词,由于硬件存储资源限制和唤醒响应速度的要求,所以比较难解决串扰问题。本实施例采用每一组唤醒词训练集单独训练,就不存在与其他唤醒词串扰的可能,可以使用较少的训练量达到更好的识别效果。
以唤醒词小美小美的粤语(xiu mei xiu mei)和普通话(xiao mei xiao mei)为例,在唤醒词同时训练和分开训练的情况下,模型测试数据的对比如表1所示。
表1
Figure PCTCN2022082571-appb-000001
从上述表1中的模型测试数据可知,本公开实施例的方案能在多个唤醒词同时识别的情况下实现较小的误唤醒率,而相关技术的方案将唤醒词同时训练,误唤醒测试的企标要求是24小时3次以下;针对本公开实施例将唤醒词分开训练,误唤醒测试可以做到72小时1次以下。且本公开实施例的方案在加入新的唤醒词时,由于做了数据分离,重新训练只需要针对新增的唤醒词,不影响已有的唤醒词,还提高了开发效率。
本公开实施例提供了一种唤醒处理方法,通过上述实施例对前述实施例的具体实现进行了详细阐述,从中可以看出,通过前述实施例的技术方案,从而不仅避免了不同唤醒词同时训练时发生唤醒词串扰的可能,实现了唤醒词的训练数据分离且互不干扰,而且在多个唤醒词同时识别的情况下,还能够降低语音设备的误唤醒率。
实施例四
基于前述实施例相同的发明构思,参见图5,其示出了本公开实施例提供的一种唤醒处理装置的组成结构示意图。如图5所示,所述唤醒处理装置50可以包括:获取单元501、处理单元502和触发单元503;其中,
获取单元501,配置为获取待识别音频;
处理单元502,配置为利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;
触发单元503,配置为根据所述至少两个置信度以及各自对应的置信度阈值之间的 比较结果,触发所述语音设备的唤醒事件。
在一些实施例中,获取单元501,具体配置为通过声音采集装置进行数据采集,获取初始语音数据;以及对所述初始语音数据进行预处理,得到所述待识别音频。
在一些实施例中,每一组训练数据包括模型参数和置信度阈值;相应地,处理单元502,具体配置为利用所述唤醒模型和所述至少两组训练数据中的模型参数分别对所述待识别音频进行处理,得到至少两个置信度,以及从所述至少两组训练数据中获取所述至少两个置信度各自对应的置信度阈值。
在一些实施例中,所述至少两组训练数据包括第一组训练数据和第二组训练数据,所述第一组训练数据包括第一模型参数和第一置信度阈值,所述第二组训练数据包括第二模型参数和第二置信度阈值;相应地,处理单元502,具体配置为利用所述唤醒模型和所述第一组训练数据中的所述第一模型参数对所述待识别音频进行处理,得到第一置信度,并从所述第一组训练数据中确定所述第一置信度对应的所述第一置信度阈值;以及利用所述唤醒模型和所述第二组训练数据中的所述第二模型参数对所述待识别音频进行处理,得到第二置信度,并从所述第二组训练数据中确定所述第二置信度对应的所述第二置信度阈值。
在一些实施例中,触发单元503,具体配置为若所述第一置信度大于或等于所述第一置信度阈值,或者所述第二置信度大于或等于所述第二置信度阈值,则触发所述语音设备的唤醒事件。
在一些实施例中,所述唤醒事件包括第一唤醒事件和/或第二唤醒事件;其中,所述第一唤醒事件与所述第一组训练数据对应的唤醒词具有关联关系,所述第二唤醒事件与所述第二组训练数据对应的唤醒词具有关联关系。
在一些实施例中,触发单元503,具体配置为若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度小于所述第二置信度阈值,则触发所述语音设备的所述第一唤醒事件;或者,若所述第二置信度大于或等于所述第二置信度阈值且所述第一置信度小于所述第一置信度阈值,则触发所述语音设备的所述第二唤醒事件;或者,若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度大于或等于所述第二置信度阈值,则计算所述第一置信度超过所述第一置信度阈值的第一值和所述第二置信度超过所述第二置信度阈值的第二值,根据所述第一值和所述第二值,触发所述语音设备的目标唤醒事件。
在一些实施例中,触发单元503,还配置为若所述第一值大于或等于所述第二值,则确定所述目标唤醒事件为所述第一唤醒事件并触发;或者,若所述第一值小于所述第 二值,则确定所述目标唤醒事件为所述第二唤醒事件并触发。
在一些实施例中,获取单元501,还配置为获取所述至少两组唤醒词训练集;
处理单元502,还配置为利用所述至少两组唤醒词训练集对所述唤醒模型进行训练,得到所述至少两组训练数据;其中,每一组训练数据包括模型参数和置信度阈值。
在一些实施例中,获取单元501,还配置为获取初始训练集;其中,所述初始训练集中包括至少两个唤醒词;以及对所述初始训练集按照不同的唤醒词进行分组,得到所述至少两组唤醒词训练集。
可以理解地,在本实施例中,“单元”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是模块,还可以是非模块化的。而且在本实施例中的各组成部分可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
所述集成的单元如果以软件功能模块的形式实现并非作为独立的产品进行销售或使用时,可以存储在一个计算机可读取存储介质中,基于这样的理解,本实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或processor(处理器)执行本实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
因此,本实施例提供了一种计算机存储介质,该计算机存储介质存储有唤醒处理程序,所述唤醒处理程序被至少一个处理器执行时实现前述实施例中任一项所述的方法的步骤。
基于上述唤醒处理装置50的组成以及计算机存储介质,参见图6,其示出了本公开实施例提供的唤醒处理装置50的具体硬件结构示意图。如图6所示,可以包括:通信接口601、存储器602和处理器603;各个组件通过总线系统604耦合在一起。可理解,总线系统604用于实现这些组件之间的连接通信。总线系统604除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图6中将各种总线都标为总线系统604。其中,通信接口601,用于在与其他外部网元之间进行收发信息过程中,信号的接收和发送;
存储器602,用于存储能够在处理器603上运行的计算机程序;
处理器603,用于在运行所述计算机程序时,执行:
获取待识别音频;
利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;
根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件。
可以理解,本公开实施例中的存储器602可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步链动态随机存取存储器(Synchronous link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本文描述的系统和方法的存储器602旨在包括但不限于这些和任意其它适合类型的存储器。
而处理器603可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器603中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器603可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本公开实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本公开实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器602,处理器603读取存储器602中 的信息,结合其硬件完成上述方法的步骤。
可以理解的是,本文描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本公开所述功能的其它电子单元或其组合中。
对于软件实现,可通过执行本文所述功能的模块(例如过程、函数等)来实现本文所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。
可选地,作为另一个实施例,处理器603还配置为在运行所述计算机程序时,执行前述实施例中任一项所述的方法的步骤。
需要说明的是,在本公开中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。
本公开所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本公开所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本公开所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (13)

  1. 唤醒处理方法,应用于语音设备,所述方法包括:
    获取待识别音频;
    利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;
    根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件。
  2. 根据权利要求1所述的方法,其中,所述获取待识别音频,包括:
    通过声音采集装置进行数据采集,获取初始语音数据;
    对所述初始语音数据进行预处理,得到所述待识别音频。
  3. 根据权利要求1所述的方法,其中,每一组训练数据包括模型参数和置信度阈值;所述利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值,包括:
    利用所述唤醒模型和所述至少两组训练数据中的模型参数分别对所述待识别音频进行处理,得到至少两个置信度,以及从所述至少两组训练数据中获取所述至少两个置信度各自对应的置信度阈值。
  4. 根据权利要求1至3任一项所述的方法,其中,所述至少两组训练数据包括第一组训练数据和第二组训练数据,所述第一组训练数据包括第一模型参数和第一置信度阈值,所述第二组训练数据包括第二模型参数和第二置信度阈值;
    所述利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值,包括:
    利用所述唤醒模型和所述第一组训练数据中的所述第一模型参数对所述待识别音频进行处理,得到第一置信度,并从所述第一组训练数据中确定所述第一置信度对应的所述第一置信度阈值;以及
    利用所述唤醒模型和所述第二组训练数据中的所述第二模型参数对所述待识别音频进行处理,得到第二置信度,并从所述第二组训练数据中确定所述第二置信度对应的所述第二置信度阈值。
  5. 根据权利要求4所述的方法,其中,所述根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件,包括:
    若所述第一置信度大于或等于所述第一置信度阈值,或者所述第二置信度大于或等于所述第二置信度阈值,则触发所述语音设备的唤醒事件。
  6. 根据权利要求5所述的方法,其中,所述唤醒事件包括第一唤醒事件和/或第二唤醒事件;其中,所述第一唤醒事件与所述第一组训练数据对应的唤醒词具有关联关系,所述第二唤醒事件与所述第二组训练数据对应的唤醒词具有关联关系。
  7. 根据权利要求6所述的方法,其中,所述根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件,包括:
    若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度小于所述第二置信度阈值,则触发所述语音设备的所述第一唤醒事件;或者,
    若所述第二置信度大于或等于所述第二置信度阈值且所述第一置信度小于所述第一置信度阈值,则触发所述语音设备的所述第二唤醒事件;或者,
    若所述第一置信度大于或等于所述第一置信度阈值且所述第二置信度大于或等于所述第二置信度阈值,则计算所述第一置信度超过所述第一置信度阈值的第一值和所述第二置信度超过所述第二置信度阈值的第二值,根据所述第一值和所述第二值,触发所述语音设备的目标唤醒事件。
  8. 根据权利要求7所述的方法,其中,所述根据所述第一值和所述第二值,触发所述语音设备的目标唤醒事件,包括:
    若所述第一值大于或等于所述第二值,则确定所述目标唤醒事件为所述第一唤醒事件并触发;或者,
    若所述第一值小于所述第二值,则确定所述目标唤醒事件为所述第二唤醒事件并触发。
  9. 根据权利要求1所述的方法,其中,所述方法还包括:
    获取所述至少两组唤醒词训练集;
    利用所述至少两组唤醒词训练集对所述唤醒模型进行训练,得到所述至少两组训练数据;其中,每一组训练数据包括模型参数和置信度阈值。
  10. 根据权利要求9所述的方法,其中,所述获取所述至少两组唤醒词训练集,包括:
    获取初始训练集;其中,所述初始训练集中包括至少两个唤醒词;
    对所述初始训练集按照不同的唤醒词进行分组,得到所述至少两组唤醒词训练集。
  11. 唤醒处理装置,应用于语音设备,所述唤醒处理装置包括获取单元、处理单元和触发单元;其中,
    所述获取单元,配置为获取待识别音频;
    所述处理单元,配置为利用唤醒模型和至少两组训练数据分别对所述待识别音频进行处理,得到至少两个置信度以及各自对应的置信度阈值;其中,所述至少两组训练数据是由至少两组唤醒词训练集通过所述唤醒模型分别训练得到的;
    所述触发单元,配置为根据所述至少两个置信度以及各自对应的置信度阈值之间的比较结果,触发所述语音设备的唤醒事件。
  12. 语音设备,所述语音设备包括存储器和处理器;其中,
    所述存储器,用于存储能够在所述处理器上运行的计算机程序;
    所述处理器,用于在运行所述计算机程序时,执行如权利要求1至10任一项所述的方法。
  13. 计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被至少一个处理器执行时实现如权利要求1至10任一项所述的方法。
PCT/CN2022/082571 2021-08-06 2022-03-23 一种唤醒处理方法、装置、设备和计算机存储介质 Ceased WO2023010861A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP22851586.2A EP4383250A4 (en) 2021-08-06 2022-03-23 WAKE-UP METHOD, DEVICE, APPARATUS AND COMPUTER STORAGE MEDIUM
JP2024531560A JP7743630B2 (ja) 2021-08-06 2022-03-23 ウェイクアップ処理方法、装置、設備及びコンピュータ記憶媒体
US18/431,630 US20240177707A1 (en) 2021-08-06 2024-02-02 Wake-up processing method and device, voice apparatus, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110904169.X 2021-08-06
CN202110904169.XA CN113782016B (zh) 2021-08-06 2021-08-06 一种唤醒处理方法、装置、设备和计算机存储介质

Publications (1)

Publication Number Publication Date
WO2023010861A1 true WO2023010861A1 (zh) 2023-02-09

Family

ID=78837028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/082571 Ceased WO2023010861A1 (zh) 2021-08-06 2022-03-23 一种唤醒处理方法、装置、设备和计算机存储介质

Country Status (5)

Country Link
US (1) US20240177707A1 (zh)
EP (1) EP4383250A4 (zh)
JP (1) JP7743630B2 (zh)
CN (1) CN113782016B (zh)
WO (1) WO2023010861A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052667A (zh) * 2023-03-08 2023-05-02 广东浩博特科技股份有限公司 智能开关的控制方法、装置和智能开关

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782016B (zh) * 2021-08-06 2023-05-05 佛山市顺德区美的电子科技有限公司 一种唤醒处理方法、装置、设备和计算机存储介质
CN116564296A (zh) * 2022-01-28 2023-08-08 博泰车联网(南京)有限公司 语音识别方法、装置及电子设备
CN116416976A (zh) * 2023-04-24 2023-07-11 思必驰科技股份有限公司 唤醒识别一体化模型的训练方法及电子设备和存储介质
CN119724169A (zh) * 2023-09-27 2025-03-28 北京小米移动软件有限公司 设备控制方法、装置、设备与可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500579A (zh) * 2013-10-10 2014-01-08 中国联合网络通信集团有限公司 语音识别方法、装置及系统
WO2017084360A1 (zh) * 2015-11-17 2017-05-26 乐视控股(北京)有限公司 一种用于语音识别方法及系统
CN107871506A (zh) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 语音识别功能的唤醒方法及装置
CN109686370A (zh) * 2018-12-24 2019-04-26 苏州思必驰信息科技有限公司 基于语音控制进行斗地主游戏的方法及装置
CN110111775A (zh) * 2019-05-17 2019-08-09 腾讯科技(深圳)有限公司 一种流式语音识别方法、装置、设备及存储介质
CN112489648A (zh) * 2020-11-25 2021-03-12 广东美的制冷设备有限公司 唤醒处理阈值调整方法、语音家电、存储介质
CN113782016A (zh) * 2021-08-06 2021-12-10 佛山市顺德区美的电子科技有限公司 一种唤醒处理方法、装置、设备和计算机存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2468203B (en) * 2009-02-27 2011-07-20 Autonomy Corp Ltd Various apparatus and methods for a speech recognition system
US10504511B2 (en) * 2017-07-24 2019-12-10 Midea Group Co., Ltd. Customizable wake-up voice commands
US10825451B1 (en) * 2018-06-25 2020-11-03 Amazon Technologies, Inc. Wakeword detection
EP4280579B1 (en) * 2018-08-09 2025-12-31 Google LLC MOTCLÉ TRIGGER RECOGNITION AND PASSIVE ASSISTANCE
JP7058574B2 (ja) * 2018-09-10 2022-04-22 ヤフー株式会社 情報処理装置、情報処理方法、およびプログラム
CN109036412A (zh) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 语音唤醒方法和系统
CN110310628B (zh) * 2019-06-27 2022-05-20 百度在线网络技术(北京)有限公司 唤醒模型的优化方法、装置、设备及存储介质
CN110534099B (zh) * 2019-09-03 2021-12-14 腾讯科技(深圳)有限公司 语音唤醒处理方法、装置、存储介质及电子设备
CN111243604B (zh) * 2020-01-13 2022-05-10 思必驰科技股份有限公司 支持多唤醒词的说话人识别神经网络模型的训练方法、说话人识别方法及系统
CN111667818B (zh) * 2020-05-27 2023-10-10 北京声智科技有限公司 一种训练唤醒模型的方法及装置
CN112259085A (zh) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 一种基于模型融合框架的两阶段语音唤醒算法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500579A (zh) * 2013-10-10 2014-01-08 中国联合网络通信集团有限公司 语音识别方法、装置及系统
WO2017084360A1 (zh) * 2015-11-17 2017-05-26 乐视控股(北京)有限公司 一种用于语音识别方法及系统
CN107871506A (zh) * 2017-11-15 2018-04-03 北京云知声信息技术有限公司 语音识别功能的唤醒方法及装置
CN109686370A (zh) * 2018-12-24 2019-04-26 苏州思必驰信息科技有限公司 基于语音控制进行斗地主游戏的方法及装置
CN110111775A (zh) * 2019-05-17 2019-08-09 腾讯科技(深圳)有限公司 一种流式语音识别方法、装置、设备及存储介质
CN112489648A (zh) * 2020-11-25 2021-03-12 广东美的制冷设备有限公司 唤醒处理阈值调整方法、语音家电、存储介质
CN113782016A (zh) * 2021-08-06 2021-12-10 佛山市顺德区美的电子科技有限公司 一种唤醒处理方法、装置、设备和计算机存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4383250A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052667A (zh) * 2023-03-08 2023-05-02 广东浩博特科技股份有限公司 智能开关的控制方法、装置和智能开关
CN116052667B (zh) * 2023-03-08 2023-06-16 广东浩博特科技股份有限公司 智能开关的控制方法、装置和智能开关

Also Published As

Publication number Publication date
JP2024528331A (ja) 2024-07-26
CN113782016A (zh) 2021-12-10
JP7743630B2 (ja) 2025-09-24
US20240177707A1 (en) 2024-05-30
EP4383250A1 (en) 2024-06-12
CN113782016B (zh) 2023-05-05
EP4383250A4 (en) 2024-10-16

Similar Documents

Publication Publication Date Title
CN113782016B (zh) 一种唤醒处理方法、装置、设备和计算机存储介质
CN111223497B (zh) 一种终端的就近唤醒方法、装置、计算设备及存储介质
CN112368769B (zh) 端到端流关键词检出
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
CN111880856B (zh) 语音唤醒方法、装置、电子设备及存储介质
CN111192590B (zh) 语音唤醒方法、装置、设备及存储介质
US20230206928A1 (en) Audio processing method and apparatus
CN111508493B (zh) 语音唤醒方法、装置、电子设备及存储介质
CN107767863A (zh) 语音唤醒方法、系统及智能终端
CN110706707B (zh) 用于语音交互的方法、装置、设备和计算机可读存储介质
CN105632486A (zh) 一种智能硬件的语音唤醒方法和装置
JP7818079B2 (ja) デジタル信号プロセッサベースの継続的な会話
CN111429901B (zh) 一种面向IoT芯片的多级语音智能唤醒方法及系统
WO2017206725A1 (zh) 一种智能冰箱、服务器及语言控制系统和方法
CN108899028A (zh) 语音唤醒方法、搜索方法、装置和终端
CN107731226A (zh) 基于语音识别的控制方法、装置及电子设备
CN111028830B (zh) 一种本地热词库更新方法、装置及设备
CN114937449B (zh) 一种语音关键词识别方法及系统
CN111126084A (zh) 数据处理方法、装置、电子设备和存储介质
CN111048068B (zh) 语音唤醒方法、装置、系统及电子设备
CN114220418A (zh) 目标说话人的唤醒词识别方法及装置
CN113129874B (zh) 语音唤醒方法及系统
CN114299964A (zh) 声线识别模型的训练方法和装置、声线识别方法和装置
CN111739515B (zh) 语音识别方法、设备、电子设备和服务器、相关系统
CN116705033A (zh) 用于无线智能音频设备的片上系统和无线处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22851586

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2024531560

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022851586

Country of ref document: EP

Effective date: 20240306