WO2020238341A1 - 语音识别的方法、装置、设备及计算机可读存储介质 - Google Patents
语音识别的方法、装置、设备及计算机可读存储介质 Download PDFInfo
- Publication number
- WO2020238341A1 WO2020238341A1 PCT/CN2020/079522 CN2020079522W WO2020238341A1 WO 2020238341 A1 WO2020238341 A1 WO 2020238341A1 CN 2020079522 W CN2020079522 W CN 2020079522W WO 2020238341 A1 WO2020238341 A1 WO 2020238341A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- language model
- target language
- dynamic target
- intention
- vocabulary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/193—Formal grammars, e.g. finite state automata, context free grammars or word networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Definitions
- This application relates to the field of artificial intelligence technology, in particular to methods, devices, equipment and computer-readable storage media for speech recognition.
- Related technologies provide a method of speech recognition. After calling a language model to recognize a voice command and understanding the user's command, the method asks the user on the one hand; on the other hand, adjusts the language model according to the question, such as adding vocabulary related to the question
- the set is integrated into the language model, so that the adjusted language model can recognize the words in the vocabulary set.
- the adjusted language model can recognize the reply voice to meet user needs.
- users may also make irrelevant voices that communicate with third parties.
- irrelevant voices For example, as a typical multi-user scene and multi-scenario scene, when a user interacts with an onboard module in a car or an electric vehicle, irrelevant speech is likely to include interruptions between the user and other users, or speech inserted by other users.
- the voice recognition system of the vehicle module will also recognize and understand these irrelevant voices as voice commands or reply voices, which makes the provided services deviate from user needs, and the user experience is poor.
- the embodiments of the present application provide a voice recognition method, device, equipment, and computer-readable storage medium to overcome the problems of poor recognition effect and poor user experience in related technologies.
- this application provides a voice recognition method, including:
- a method for speech recognition includes: obtaining or generating a dynamic target language model according to reply information to a first intention. Since the dynamic target language model includes a front-end part and a core part, the core The part is used to determine the possible description related to the reply information, and the front-end part is used to determine the description of the confirmatory information of the reply information; thus, the voice signal is obtained, and after the keyword is generated by parsing the voice signal, it can be called The dynamic target language model determines the second intention and service content, wherein the front-end part of the dynamic target language model parses the second intention according to the keywords, and the core part of the dynamic target language model is based on The keyword parses out the service content.
- the first intention includes the intention obtained by parsing the voice signal of the user after the voice dialogue between the user and the vehicle-mounted module starts.
- the reply information of the first intention includes one or more reply information that the vehicle-mounted module returns to the user for the first intention, and the vehicle-mounted module obtains a dynamic target language model including the front-end part and the core part according to the reply information to the first intention.
- the vehicle-mounted module After the vehicle-mounted module returns one or more reply messages to the user, the vehicle-mounted module will obtain the voice signal again.
- the voice signal acquired by the vehicle-mounted module again may include the voice signal of the dialogue between the user and the vehicle-mounted module, that is, the voice signal related to the reply message, and the irrelevant voice signal of the dialogue between the user and other users.
- the vehicle-mounted module parses the acquired voice signal to generate keywords, calls the dynamic target language model, and parses the part of the vocabulary related to the reply message from the generated keywords.
- the dynamic target language model includes a front-end part and a core part.
- the front-end part is used to determine the user's description of the confirmation information of the reply message.
- the confirmation information can include confirmation, correction, and cancellation.
- the user’s second intention can be obtained by parsing keywords through the front-end part. For example, if the number of replies to the first intention includes one item, and the confirmation information obtained by the front-end part parsing the keywords includes the confirmation message "right, yes", It can be confirmed that the second intention of the user is the intention indicated by the reply message of the first intention.
- the core part is used to determine the possible descriptions related to the reply information.
- the vocabulary used by the user to describe the reply information can be parsed from the keywords, so as to obtain the service content based on the vocabulary to provide the user with the service content
- the service indicated by the service content may be provided by a third-party cloud service, the service indicated by the service content may also be provided by the vehicle-mounted module, or the service indicated by the service content may be provided by the vehicle-mounted terminal, or the service may be provided by the vehicle.
- the enterprise provides the service indicated by the service content.
- the vehicle-mounted terminal may be other terminals on the vehicle other than the vehicle-mounted module, such as a vehicle-mounted display screen, a vehicle-mounted air conditioner, and a vehicle-mounted speaker.
- the vehicle-mounted terminal may also be other terminals on the vehicle other than the vehicle-mounted module, such as a vehicle-mounted display screen, a vehicle-mounted air conditioner, and a vehicle-mounted speaker.
- the front-end part and the core part of the dynamic target language model are obtained based on the reply information, the second intention obtained through the front-end part and the service content obtained through the core part are all related to the first intention, and Voice signals unrelated to the first intention are ignored. Therefore, the embodiment of the present application has a better effect of performing voice recognition, avoids the deviation of the provided service from user requirements due to interference of irrelevant voice signals, and improves the user experience.
- the dynamic target language model further includes a tailing part for confirming whether there is an additional intention
- the method further includes: invoking the dynamic target language model to determine the additional intention, the dynamic target language model The tail part of parses the additional intention according to the keyword.
- the tailing part includes tailing flag words; the invoking the dynamic target language model to determine the additional intent, the tailing part of the dynamic target language model parses the additional intent according to the keywords, including : The tailing part parses out the reference tailing marker word according to the keyword and the time point of the reference tailing marker word; based on the reference tailing marker word, combining the first intention and the second intention, The dynamic target language model is updated to obtain an updated target language model; the updated target language model is invoked, and the additional intent is parsed according to the time point where the keyword and the reference tag word are located.
- the method before the acquiring the voice signal, the method further includes: buffering a historical voice signal; the parsing the voice signal to generate keywords includes: parsing the voice signal, and using the historical voice signal for context detection Then generate the keywords.
- Context detection through historical voice signals can make the recognized keywords more suitable for the current scene, thereby further improving the accuracy of voice recognition.
- the method further includes: confirming the second intention, and obtaining the confirmed second intention.
- the confirming the second intention and obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and sending the second intention fed back by the user to the user.
- the second intention is the confirmed second intention.
- the second intention is made more accurate, so as to provide more accurate service content.
- the obtaining the dynamic target language model according to the reply information to the first intention includes: converting the reply information of the first intention into a reference format to obtain reply information in the reference format, and the reply according to the reference format Information acquisition or generation of the dynamic target language model. Since different suppliers may provide response information in different formats, the response information is converted into a reference format, so that the format of the response information is unified to facilitate the reception of the response information. For different application fields, the reply information is converted into different reference formats, so that the reply message format in the same application field is the same.
- the obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and automatically converting the weighted finite state A machine model is used as the dynamic target language model, wherein the trained language model is obtained by training the reply information in the reference format and reference vocabulary.
- the reference vocabulary includes, but is not limited to, the category name corresponding to the vocabulary in the reply message in the reference format, and referential expression words.
- the obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and automatically converting the weighted finite state
- the machine model is used as the first language model, wherein the trained language model is obtained by training the response information in a reference format with a length not less than the reference length; acquiring the second language model according to the response information in the reference format with a length less than the reference length, Obtain a third language model according to the reference vocabulary; merge the first language model, the second language model, and the third language model to obtain a total language model, and use the total language model as the dynamic target language model.
- the obtaining or generating the dynamic target language model according to the reply information in the reference format includes: obtaining a word confusion network based on the reply information in the reference format with a length not less than the reference length, in the word confusion network
- Each vocabulary of has a transition probability; the penalty weight of each vocabulary is calculated, and the word confusion network is transformed into a weighted finite state transition automaton model according to the penalty weight of each vocabulary, and the weighted finite state transition automaton
- the model is used as the first language model;
- the second language model is acquired according to the reply information in the reference format whose length is lower than the reference length, and the third language model is acquired according to the reference vocabulary;
- the first language model, the second language model and the The third language model is used to obtain an overall language model, and the overall language model is used as the dynamic target language model.
- the calculating the penalty weight of each vocabulary includes: for any vocabulary, using a negative logarithm of the transition probability of the vocabulary as the penalty weight.
- the transition probability of a word is used to indicate the frequency of occurrence of the word in the category of the word. The higher the frequency of occurrence of the word in the category, the greater the transition probability and the smaller the negative logarithm of the transition probability, which is the penalty weight. It is inversely proportional to the frequency of occurrence, so that the target language model can better parse the words that appear frequently in the category.
- the calculating the penalty weight of each vocabulary includes: for any vocabulary, using the logarithm of the number of items of the reply information in the reference format of the vocabulary as the penalty weight.
- the more discriminating vocabulary that is, the vocabulary containing the vocabulary with a smaller number of response information in the reference format, is given a smaller penalty weight, so that the target language model can better parse these discriminating vocabulary.
- the calculating the penalty weight of each vocabulary includes: for any vocabulary, the logarithmic value of the number of occurrences of the vocabulary in the reply information in the reference format is used as the penalty weight. For words with strong distinction, that is, words with fewer occurrences, the penalty probability of the words is smaller, so that the dynamic target language model can better analyze the words with strong distinction.
- a device for speech recognition includes: a first acquisition module, which acquires or generates a dynamic target language model according to response information to the first intention, the dynamic target language model including a front-end part and a core part , The core part is used to determine possible descriptions related to the reply information, the front-end part is used to determine the description of the confirmatory information of the reply information; the second acquisition module is used to acquire the voice signal and analyze the The speech signal generates keywords; the first determining module is used to call the dynamic target language model to determine the second intention and service content, wherein the front-end part of the dynamic target language model parses out the keyword according to the keyword The second intention is that the core part of the dynamic target language model parses the service content according to the keywords.
- the dynamic target language model further includes a tailing part for confirming whether there is an additional intent
- the device further includes: a second determining module for calling the dynamic target language model to determine the additional intent Analyzing the additional intention according to the keyword in the tail part of the dynamic target language model.
- the tailing part includes tailing marker words
- the second determining module is configured to parse the tailing part to obtain reference tailing marker words according to the keywords, and the time point where the reference tailing marker words are located; Based on the reference suffix tag word, in combination with the first intent and the second intent, update the dynamic target language model to obtain an updated target language model; call the updated target language model according to the The keyword and the time point at which the reference ending tag word is located are analyzed to parse the additional intention.
- the device further includes: a cache module, configured to cache historical voice signals; the second acquisition module, configured to parse the voice signals, and generate the keywords after context detection using the historical voice signals .
- the device further includes: a confirmation module, configured to confirm the second intention, and obtain the confirmed second intention.
- a confirmation module configured to confirm the second intention, and obtain the confirmed second intention.
- the confirmation module is configured to send confirmation information of the second intention to the user, obtain the second intention fed back by the user, and use the second intention fed back by the user as the confirmed second intention.
- the first obtaining module is configured to convert the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtain or generate the dynamic target language according to the reply information in the reference format model.
- the first acquisition module is configured to transform a trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the dynamic target language model, wherein
- the trained language model is obtained from the reply information in the reference format and reference vocabulary training.
- the first acquisition module is configured to convert the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the first language model, wherein the trained The language model of is obtained by training the reply information of the reference format with a length not less than the reference length; the second language model is obtained according to the reply information of the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; A language model, the second language model, and the third language model are used to obtain an overall language model, and the overall language model is used as the dynamic target language model.
- the first acquiring module includes: a first acquiring unit configured to acquire a word confusion network based on reply information in a reference format with a length not less than a reference length, and each word in the word confusion network has a transition Probability; a calculation unit for calculating the penalty weight of each vocabulary, according to the penalty weight of each vocabulary, the word confusion network is transformed into a weighted finite state transition automaton model, and the weighted finite state transition automaton model As the first language model; the second acquiring unit is configured to acquire the second language model according to the reply information in the reference format whose length is lower than the reference length, and the third language model according to the reference vocabulary; the merging unit is configured to merge the first The language model, the second language model, and the third language model obtain a total language model, and the total language model is used as the dynamic target language model.
- the calculation unit is configured to, for any vocabulary, use the negative logarithm of the transition probability of the vocabulary as the penalty weight.
- the calculation unit is configured to use, for any vocabulary, a logarithmic value of the number of reply messages in a reference format containing the vocabulary as the penalty weight.
- the calculation unit is configured to use, for any word, a logarithmic value of the number of occurrences of the word in the reply message in the reference format as the penalty weight.
- a device for speech recognition includes: a memory and a processor; the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above Apply for the method in the first aspect or any possible implementation of the first aspect of the embodiment.
- processors there are one or more processors and one or more memories.
- the memory may be integrated with the processor, or the memory and the processor may be provided separately.
- the memory can be a non-transitory (non-transitory) memory, such as a read only memory (ROM), which can be integrated with the processor on the same chip, or can be set in different On the chip, the embodiment of the present application does not limit the type of memory and the setting mode of the memory and the processor.
- ROM read only memory
- a computer-readable storage medium stores a program or instruction, and the instruction is loaded by a processor and executes any of the above-mentioned voice recognition methods.
- a computer program (product) is also provided, the computer program (product) comprising: computer program code, when the computer program code is run by a computer, the computer executes any of the above-mentioned voice recognition methods.
- a chip including a processor, which is configured to call and execute instructions stored in the memory from a memory, so that a communication device installed with the chip executes any of the above-mentioned voice recognition methods.
- Another chip including: an input interface, an output interface, a processor, and a memory, the input interface, output interface, the processor, and the memory are connected by an internal connection path, and the processor is used to execute When the code in the memory is executed, the processor is used to execute any of the above-mentioned voice recognition methods.
- the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention. After analyzing the voice signal to obtain keywords, the dynamic target language model is called to analyze the keywords to obtain the second intention and service content. . Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, that is, it can recognize discontinuous multi-intent voices, avoiding the deviation of the provided service content from the user needs, the recognition effect is good, and the user experience is improved .
- the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention. After analyzing the voice signal to obtain keywords, the dynamic target language model is called to analyze the keywords to obtain the second intention and service content. . Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, that is, it can recognize discontinuous multi-intent voices, avoiding the deviation of the provided service content from the user needs, the recognition effect is good, and the user experience is improved .
- Figure 1 is a schematic diagram of an implementation environment provided by an embodiment of the application.
- FIG. 2 is a block diagram of a method for implementing speech recognition provided by an embodiment of the application
- FIG. 3 is a flowchart of a voice recognition method provided by an embodiment of this application.
- FIG. 4 is a schematic diagram of the structure of a language model provided by an embodiment of the application.
- FIG. 5 is a schematic diagram of a process of speech recognition provided by an embodiment of this application.
- Fig. 6 is a schematic structural diagram of a language model provided by an embodiment of the application.
- FIG. 7 is a schematic structural diagram of a language model provided by an embodiment of the application.
- FIG. 8 is a schematic structural diagram of a word confusion network provided by an embodiment of this application.
- Fig. 9 is a schematic structural diagram of a speech recognition device provided by its own embodiment.
- Related technologies provide a method of speech recognition. After calling a language model to recognize a voice command and understanding the user's command, the method asks the user on the one hand; on the other hand, adjusts the language model according to the question, such as adding vocabulary related to the question
- the set is integrated into the language model, so that the adjusted language model can recognize the words in the vocabulary set.
- the adjusted language model can recognize the reply voice to meet user needs.
- the user's voice is often more flexible.
- the user and the car module may have the following voice dialogue:
- the voice recognition system of the on-board module will ask the question based on the voice command, and then integrate the vocabulary "Sichuan Restaurant A” involved in the question into the language model to obtain an adjusted language model. Later, if the user uses "Sichuan Restaurant A” to send out the reply voice "Yes, it is Sichuan Restaurant A", the adjusted language model can recognize the reply voice.
- the user first uttered an irrelevant voice communicating with other users in the car, so the adjusted language model will recognize the irrelevant voice as a reply voice, which leads to misunderstanding. It can be seen that the speech recognition method provided by the related technology has poor recognition effect and poor user experience.
- the embodiment of the present application provides a method for speech recognition, which can be applied in the implementation environment as shown in FIG. 1.
- the audio equipment includes a microphone array and a speaker
- the memory stores programs or instructions of a module for voice recognition.
- Audio equipment, memory, and CPU are connected through data bus (databus, D-Bus) communication, so that the CPU calls the microphone array to collect the voice signal from the user, and runs the programs or instructions and calls of each module stored in the memory based on the collected voice signal
- the speaker sends out a voice signal to the user according to the running result.
- the CPU may also access the cloud service through a gateway to obtain data returned by the cloud service.
- the CPU can also access the controller area network bus (CAN-Bus) through the gateway to read and control the status of other devices.
- CAN-Bus controller area network bus
- the programs or instructions of the modules for voice recognition stored in the memory include: the ring voice buffer module, AM module, SL module, Dynamic LM module, Programs or instructions such as SLU module, DM module, and NCM process are executed by the CPU in FIG. 1 to implement the voice recognition.
- the process of voice recognition will be described:
- Front-end speech module used to distinguish the voice signal sent by the user from non-speech signals such as road noise and music, and also used to reduce and enhance the voice signal sent by the user to improve subsequent recognition and Accuracy of understanding.
- Circular voice buffer module used to buffer the voice signal processed by the front-end language model so that the stored voice signal can be recognized and understood many times.
- the ring language buffer has a reference time length. When the time length of the buffered voice signal is greater than the reference time length, the voice signal with the longest storage time will be overwritten by the new voice signal.
- A Acoustic model (AM) module: used to obtain the voice signal stored in the ring voice buffer module, and convert the voice signal into a phoneme sequence.
- Selective listening (SL) module used to call the dynamic language model (dynamic language model, Dynamic LM) module, convert the phoneme sequence output by the AM model into keywords, and send the keywords to spoken language understanding, SLU) module.
- dynamic language model dynamic language model, Dynamic LM
- SLU spoken language understanding
- SLU module used to extract intentions and semantic slots from keywords to understand the first intention, second intention, and additional intentions indicated by the user's voice signal.
- Dialogue manager (DM) module used to request reply information to the cloud service according to the first intention.
- Application manager (APP Manager) module used to convert the reply information returned by the cloud service into reply information in a reference format.
- Dialogue manager (DM) module It is also used to start non-continuous multi-intent (NCM) processes in related fields according to the response information in the reference format returned by the APP Manager module, and to control the response generation (
- the response generator (RG) module generates reply content and performs voice playback. It is also used to send instructions to the APP Manager module according to the second intention and the additional intention to control the application or terminal device to execute the service content and the additional intention.
- Application management (application manager, APP Manager) module: It is also used for word segmentation and proper noun labeling for reply messages. It is also used to manage applications and terminal devices according to instructions sent by the DM module to control applications or terminal devices to execute service content and additional intentions.
- an embodiment of the present application provides a voice recognition method. As shown in Figure 3, the method includes:
- Step 201 Acquire or generate a dynamic target language model according to the response information to the first intention.
- the dynamic target language model includes a front-end part and a core part.
- the core part is used to determine possible descriptions related to the response information
- the front-end part is used to determine the response The description of the confirmatory information.
- the first intention refers to the intention obtained by analyzing the user's voice command signal after the voice dialogue between the user and the system starts.
- the user's voice command signal is the voice of "Help me find a nearby Sichuan restaurant" issued by the user.
- Analyzing the voice command signal includes: calling an acoustic model to convert the voice command signal into a sequence of phonemes.
- a phoneme is the smallest phonetic unit of a language. For example, in Chinese, a phoneme refers to a consonant or a final.
- the language model is called to convert the phoneme sequence into a text sequence, and the text sequence is a voice command.
- the language model refers to a language model that has been trained according to the training set, and an appropriate language model can be called according to the application domain of speech recognition.
- the text sequence can be parsed to obtain the first intention.
- the first intention includes intention and semantic slot
- semantic slot refers to words with clear definitions or concepts in a character sequence. Still taking the above voice dialogue as an example, if the text sequence is "Help me find a nearby Sichuan restaurant", the parsed intention is “navigation”, and the semantic slot is "nearby” and “Sichuan restaurant”, and the first intention is "Navigate to the nearby Sichuan restaurant".
- the response information of the first intention can be obtained according to the obtained first intention, and the content of the response information of the first intention meets the requirements of the semantic slot.
- the first intention can be sent to the cloud service, so as to obtain the reply information returned by the cloud service.
- a plurality of mapping relationships between intentions and reply information may also be stored in the memory, and the reply information corresponding to the first intention can be searched for according to the mapping relationship, so as to realize the acquisition of the reply information.
- the reply information can be one or more items, and each reply information is a text string. Moreover, if there are multiple reply messages, multiple reply messages can be used as options for the user to choose from. Still taking the above voice dialogue as an example, the reply message can be one item of "Sichuan Restaurant A” or multiple items such as “Sichuan Restaurant A", “Sichuan Restaurant B” and “Sichuan Restaurant C”. In this embodiment, the number of items of the reply message is not limited.
- the dynamic target language model can be acquired or generated based on the acquired response information of the first intention.
- the dynamic target language model includes a front-end part and a core part.
- the front-end part is used to determine the description of the confirmation information of the reply message.
- the confirmation information includes but is not limited to confirmation, correction or cancellation.
- the confirmation information can include "right” and “right”
- the correction information can include “not right” and “wrong”
- the cancellation information can include "forget it” and "no need.”
- the core part is used to determine possible descriptions related to the reply information, for example, the user directly repeats the reply information or the user selectively repeats the reply information.
- the process of obtaining or generating a dynamic target language model based on the response information will be described in detail later, and will not be repeated here.
- the speech signal can be further received after the dynamic target language model is obtained through the acquisition or generation.
- Step 202 Acquire a voice signal, analyze the voice signal to generate keywords.
- the vehicle-mounted module After the vehicle-mounted module obtains the reply information to the first intention, in addition to obtaining or generating a dynamic target language model according to the reply information to the first intention, it will also send the reply information to the first intention to the user to obtain the voice signal .
- the voice signal may include the voice signal of the user's dialogue with the vehicle-mounted module, that is, the voice signal of the reply message for the first intention, or may include the irrelevant voice signal of the user's dialogue with other users.
- the voice signal of the user’s dialogue with the on-board module is “Yes, it’s Sichuan Restaurant A”, and the irrelevant voice signal of the user’s dialogue with other users is “This time period happens to be noon, will there be a problem with parking ".
- irrelevant voice signals in addition to the voice signals that the user actively dialogues with other users, it can also include the voice signals of other users actively dialogue with the user, that is, the voice signals of other users interjecting. This embodiment does not limit the irrelevant voice signals. .
- the vehicle-mounted module After the vehicle-mounted module obtains the voice signal, it can parse the voice signal to generate keywords.
- the method before acquiring the voice signal, the method further includes: buffering the historical voice signal. Then, parsing the voice signal to generate keywords includes: parsing the voice signal, and generating keywords after context detection using historical voice signals.
- the historical voice signal is the voice signal of the past time.
- the voice command signal "Help me find a nearby Sichuan restaurant" used to obtain the first intention can be used as the historical voice signal.
- the historical voice signal can be buffered through a ring buffer.
- the ring buffer has a reference time length. If the buffered historical voice signal is longer than the reference time length, the historical voice signal with the longest buffer time will be updated. Voice signal coverage. If you need to use the historical voice signal, just read the historical voice signal from the ring buffer.
- this embodiment does not limit the manner of buffering historical voice signals, and other methods may be selected to realize the buffering of historical voice according to needs.
- the vehicle-mounted module can still call the appropriate acoustic model and language model according to the application field of the speech recognition, and analyze the speech signal through the acoustic model and the language model to obtain the initial keywords.
- the initial keywords generated by analyzing the voice signal of the dialogue between the user and the on-board module are related to the first intent, and the parse is irrelevant to the dialogue between the user and other users.
- the initial keywords generated by the voice signal have nothing to do with the first intention. Therefore, historical speech signals need to be used for context detection, so that keywords generated based on initial keywords are only related to the first intention, that is, initial keywords that are not related to the first intention are ignored.
- the manner of using the historical voice signal for context detection may include: detecting keywords related to the historical voice signal in the initial keywords, so as to use the keywords related to the text sequence corresponding to the historical voice signal as the generated keywords. For example, for the voice signal "This time is exactly noon, will there be a problem with parking? Yes, it is Sichuan Restaurant A".
- the initial keywords obtained include “noon”, “parking”, “yes, yes” and “ Sichuan Restaurant A”.
- keywords related to the historical voice signal "Help me find a nearby Sichuan restaurant” include “Yes, yes” and “Sichuan restaurant A”. Therefore, “noon” and “parking” can be ignored. Only “Yes, that is” and “Sichuan Restaurant A” are used as the generated keywords.
- this embodiment does not limit the method of context detection using historical speech signals. No matter what method is used to detect and generate keywords, after the keywords are generated, the dynamic target language model can be triggered to parse the keywords. To determine the second intention and service content, see step 203 for details.
- Step 203 Invoke the dynamic target language model to determine the second intention and service content.
- the front-end part of the dynamic target language model parses the second intention based on keywords, and the core part of the dynamic target language model parses the service content based on keywords.
- the dynamic target language model includes a front-end part and a core part. Since the dynamic target language model is obtained by replying to the first intention, the second intention and service content determined by the dynamic target language model are all related to the first intention. Among them, the front-end part is used to determine the description of the confirmatory information of the reply message. Therefore, the confirmatory information in the keywords can be obtained by analyzing the keywords in the front-end part, and then the user's second intention can be obtained through the confirmatory information in the keywords.
- the reply message to the first intention is "Do you want to go to Sichuan Restaurant A”
- the keywords obtained by parsing are “Yes, yes” and “Sichuan Restaurant A”, which can be parsed through the front-end part "Yes, yes” in the keyword, you can get the user's second intention as “Go to Sichuan Restaurant A”.
- the key word “Sichuan Restaurant A” is obtained through the analysis of the core part, combined with the current car navigation scene, and the service content is "Navigate to Sichuan Restaurant A".
- the reply message to the first intention includes only one item
- the user's second intention can be determined through the front-end part.
- the reply information to the first intention includes two or more items
- the front-end part and the core part need to be used to determine the user's second intention. For example, if the reply message to the first intention is "Which one of the following do you want to choose? The first item is Sichuan Restaurant A, and the second item is Sichuan Restaurant B", and the key words obtained by analysis are still "Yes, yes” and "Sichuan Restaurant A” can still be parsed through the front-end part to obtain the confirmatory information "yes, that is” in the keywords.
- the confirmatory information obtained by the front-end part parsing the keywords includes confirmation information, such as "yes, yes” in the above-mentioned voice dialogue
- the key words can be further analyzed through the core part to obtain the service content.
- the confirmation information obtained by the front-end parsing keywords includes correction information or cancellation information, such as "no", "not right” and other words, it means that the user does not approve the reply message and may not respond to the reply message, so there is no need to pass the core Part of the analysis is performed to obtain the service content, but other reply information is obtained again, and a new dynamic target language model is obtained based on the other reply information, so as to complete speech recognition through the new dynamic target language model.
- invoking the dynamic target language model can also obtain the confidence of the second intention and service content, the mute signal segment in the voice signal, and other information, where the confidence is used to indicate the second intention and the server The accuracy of the content.
- the provision of the service indicated by the service content can be triggered.
- the service content in the voice dialogue is "Navigate to Sichuan Restaurant A”
- the execution service content includes calling the navigation device to navigate the user from the current location (ie, the location where the voice dialogue occurs) to the location of "Sichuan Restaurant A”.
- the method provided in this embodiment further includes: confirming the second intention, and obtaining the confirmed second intention; and executing the confirmed second intention.
- the second intention and service content determined by the dynamic target language model may still be inconsistent with the first intention . Therefore, before executing the service content, confirm the second intention to ensure that the second intention is consistent with the first intention. After the confirmed second intention is obtained, the confirmed second intention is executed.
- the second intention is consistent with the first intention, including but not limited to: the second intention corresponds to the reply information of the first intention (for example, the second intention "Go to Sichuan Restaurant A” and the reply information of the first intention "Sichuan Restaurant A” correspond). Or, the second intention satisfies the condition restriction included in the first intention (for example, the second intention "Go to Sichuan Restaurant A” satisfies the distance condition restriction "near" included in the first intention).
- the method of obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and using the second intention fed back by the user as the confirmed second intention Second intention.
- the confidence level of the second intention and the service content can be obtained through the dynamic target language model. Therefore, this embodiment can send different confirmation information to the user for different confidence levels to realize the confirmation of the second intention.
- the second intent as "Go to Sichuan Restaurant A” as an example
- the confidence level is higher than the threshold, it means that the second intent is more credible, so indirect confirmation can be used to confirm.
- the voice “You have selected Sichuan Restaurant A” with the default second intention correct is sent to the user as the confirmation information of the second intention to obtain the second intention returned by the user.
- the degree of confidence is not higher than the threshold, it means that the credibility of the second intention is low, so direct confirmation is used for confirmation. For example, a voice of "Are you sure to choose Sichuan Restaurant A" is sent to the user.
- the confirmation information sent in the above indirect confirmation method and direct confirmation method are both voice confirmation information. If the second intention of the user feedback cannot be obtained through the voice confirmation information, other forms of confirmation information, such as text confirmation information, can be selected to confirm to the user The second intention.
- the terminal displays the reply message for the first intention to the user, so that the user can select any reply message through the terminal, take the intention indicated by the reply message selected by the user as the confirmed second intention, and execute the confirmed second intention.
- the second intention is to complete speech recognition.
- the dynamic target language model further includes a tailing part, which is used to confirm whether there is an additional intention. Therefore, the method provided in this embodiment further includes: invoking the dynamic target language model to determine additional intentions, wherein the tail part of the dynamic target language model parses the additional intentions according to keywords, thereby identifying each intention in the above multi-intention dialogue .
- the additional intent is obtained by parsing keywords through the tail part.
- the schematic diagram of the front part, the core part and the tail part can be seen in Fig. 4.
- out of vocabulary (OOV) represents vocabulary outside the dictionary, and the dictionary is used to obtain words based on phoneme sequences.
- eps stands for skipping edge and is used to indicate optional parts.
- the suffix part includes a suffix marker word
- the suffix marker word includes but is not limited to words such as "re", “also” and “by the way”.
- the suffix marker word in the above-mentioned multi-intent dialogue is "zhao". Since the user's description of the tail marker words is often relatively fixed, a set of multiple tail marker words can be used as the corpus to train the language model, and the trained language model can be used as the tail part. Therefore, the dynamic target language model is called to determine the additional intent, where the tail part of the dynamic target language model parses the additional intention according to the keywords, including: the tail part parses the reference tail tag word according to the keyword, and refers to the time point of the tail tag word. ; Based on the reference tag word, combining the first intention and the second intention, update the dynamic target language model to get the updated target language model; call the updated target language model, according to the keyword and the time point of the reference tag word Resolve additional intent.
- the reference tail marker word is one of a set of multiple tail marker words as a corpus. If there is no reference tag word, it means that there is no additional intention, and the service indicated by the above service content can be directly provided; if there is a reference tag word, it means that there is an additional intention, and the reference tag word is also obtained at the end The point in time.
- the language model is also called according to the first intention and the second intention.
- the language model can be the language model of the domain of the first intention and the second intention.
- the domain of the intent is "navigation"
- the language model of the navigation domain can be obtained instead of the dynamic target language model, so as to obtain the updated target language model.
- the updated target language model is called to parse the keywords after the time point where the reference tag word is located, so as to obtain the additional intention of the user.
- the reference ending mark word is "re”.
- the voice signal before the time of "Zai” is "This time is exactly noon, will there be a problem with parking? Yes, it is Sichuan Restaurant A”.
- the keywords included in the dynamic target language model have been included in the front-end part of the dynamic target language model. And the core part has been analyzed. Therefore, the updated target language model can be called to analyze the keywords included in the voice signal after the time point of "re”, that is, the keywords included in "find a parking space for me”, so as to obtain additional intentions of the user.
- this embodiment also provides another method for updating the target language model as follows: After obtaining the language model according to the first intention and the second intention, the language model and the combined model of the tail part are used as the updated target Language model. Therefore, referring to FIG. 5, the updated target language model can iteratively detect whether there are more additional intents after analyzing an additional intent, which increases the number of intents that can be identified.
- the additional intention exists, after the additional intention is obtained through analysis of the updated target language model, the following method is used to execute the second intention.
- the method includes: if the additional intention exists, execute the service content and the additional intention. Among them, after the service content is obtained, the service content is not executed immediately, but the tail part is used to confirm whether there is an additional intention in the voice signal. If there is an additional intention, the additional intention is obtained, and finally the service content and the additional intention Carry out execution. If it is confirmed through the tail message that there is no additional intention in the voice signal, the service content obtained will be executed.
- executing the service content and the additional intention includes: executing the service content and the additional intention in combination, or executing the service content and the additional intention sequentially. For example, if the service content is "Navigate to Sichuan Restaurant A" and the additional intention is "Play a song”, the additional intention can be executed during the execution of the service content, that is, the service content and the additional intention are combined. If the service content is "Navigate to Sichuan Restaurant A” and the additional intention is "Find a parking space", the service content and the additional intention need to be executed in sequence. In addition, different service contents and additional intentions can be executed by different execution entities.
- the vehicle-mounted terminal may be other terminals on the vehicle other than the vehicle-mounted module, such as a vehicle-mounted display screen, a vehicle-mounted air conditioner, and a vehicle-mounted speaker.
- it can also be executed jointly by more than two execution entities of a third-party cloud service, an in-vehicle module, an in-vehicle terminal, and a car company, which is not limited in the embodiment of the present application.
- obtaining or generating a dynamic target language model according to the reply information to the first intention includes: converting the reply information to the first intention into a reference format to obtain reply information in the reference format, and obtaining or Generate a dynamic target language model.
- the dynamic target language model includes at least a front-end part and a core part, and can also include a tail part.
- the front-end part is used to determine the description of the confirmation information of the reply message. Similar to the tail part, since the user's description of the confirmatory information of the reply message is relatively fixed, for the front-end part, a collection of confirmatory information for confirmation, correction or cancellation can also be used as a corpus to train the language model ,
- the trained language model is used as the front-end part, so that the front-end part has the ability to parse keywords to obtain confirmation, correction, or cancellation.
- the core part it needs to be obtained according to the reply information in the reference format described above.
- the reply information may be provided by multiple suppliers. Since different suppliers may provide reply information in different formats, the reply information needs to be converted into a reference format, so that the reply information format is unified to facilitate the reception of the reply information.
- the reply message can be converted into different reference formats, so that the reply message format in the same application field is the same.
- the reply message is often an address, so the address can be unified into the format of country (or region), province (or state), city, district, road, and house number.
- POI point of interest
- the reply information is often related to the point of interest. Therefore, the reply information can be unified into the format of category name, address, contact number, and user evaluation.
- the category name can be hotel, Restaurants, shopping malls, museums, concert halls, cinemas, stadiums, hospitals and pharmacies, etc.
- the reply information can be marked with word segmentation to facilitate the implementation of the conversion of the reference format.
- Word segmentation labeling refers to decomposing a text string into vocabulary. If the decomposed vocabulary includes proper nouns, proper nouns can also be labeled. Both word segmentation and labeling can be implemented by artificial intelligence algorithms.
- artificial intelligence algorithms include but are not limited to conditional random field (CRF), long short term memory (LSTM) network, and hidden Markov model (hidden Markov model, HMM).
- the dynamic target language model is further obtained or generated according to the reply information in the reference format.
- the target language model according to the response information in the reference format there are three ways to obtain the target language model according to the response information in the reference format:
- the first method of acquisition convert the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as a dynamic target language model.
- the trained language model is composed of the reply information in the reference format and the reference Vocabulary training is obtained.
- the reference vocabulary includes, but is not limited to, the category name corresponding to the vocabulary in the reply message in the reference format, and referential expression words.
- the vocabulary in the response message in the reference format can be obtained through word segmentation and other methods, so as to further obtain the category name corresponding to the vocabulary.
- the category name of "Sichuan Restaurant A” is "Restaurant”.
- Referential expression words are used to refer to any reply message in a reference format. For example, when there are multiple reply messages in a reference format, the referential expression words include "the first item", “the middle one", and “reciprocal”. The second", "last item”, etc.
- the trained language model includes the initial language model trained with reference format reply information and reference vocabulary as corpus.
- the initial language model may adopt the N-gram model, and the schematic diagram of the N-gram model can be seen in FIG. 6.
- the N-gram model assumes that the occurrence probability of a word is only related to the N words before the word, and has nothing to do with other words. For example, when the value of N is 3, N-gram model 3rd order model, the occurrence probability of a word at this time with two words before the word concerned is located, i.e. X i i-th word appears The probability is P(X i
- the N-gram model can count the probability of a word appearing after another word, that is, the probability of two words appearing next to each other.
- the N-gram model is trained through the corpus, and the trained N-gram model is obtained.
- the trained N-gram model has counted the probability of adjacent occurrence of each word contained in the corpus.
- the trained language model can be converted into a weighted finite state transform automata model (weighted finite state transducer, WFST).
- WFST can convert the input phoneme sequence into words based on the dictionary, and then obtain the weight of each word adjacent to each other based on the probability of each word appearing adjacently calculated by the trained language model, and output core information according to the weight.
- the core information can be regarded as a word sequence, so the appearance probability of the core information is the product of the weights of all words contained in the word sequence.
- the analysis range of the trained language model can be expanded through conversion.
- the trained language model can obtain the vocabulary and reference vocabulary in the reply message by parsing keywords, and the converted WFST can not only analyze the vocabulary in the reply message and In addition to the reference vocabulary, two or three combinations of the vocabulary in the reply message, the category name corresponding to the vocabulary, or the referential expression words can also be obtained. For example, WFST can analyze the combination of referential expression words and the category name corresponding to the vocabulary "the restaurant in the middle" and so on.
- the WFST is the core part of the dynamic target language model.
- the WFST and the front-end part can be used as a dynamic target language model.
- the second method of acquisition convert the trained original model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the first language model, where the trained language model is not less than the reference length
- the response information of the reference format is obtained through training training; the second language model is obtained according to the response information of the reference format whose length is less than the reference length; the third language model is obtained according to the reference vocabulary; the first language model, the second language model and the third language are merged Model, get the overall language model, and use the overall language model as the dynamic target language model.
- the reference vocabulary can be referred to the description in the first acquisition method, which will not be repeated here.
- reply messages with a length less than the reference length and reference words are not used as corpus, but only reply messages with a length not less than the reference length are used as the corpus.
- the language model of is the initial language model trained with the reply message whose length is not less than the reference length as the corpus, and the initial language model can still be an N-gram model.
- the reference length is 2, that is, two words.
- the reason is that the N-gram model uses a back-off algorithm.
- the backtracking algorithm refers to: for a word sequence that has not appeared in the corpus, the occurrence probability of a lower-order word sequence can be used as the occurrence probability of the word sequence, so as to ensure that the N-gram model can be used for any phoneme sequence input. Output the result. For example, when the absence of sequence of words (X i-2, X i -1, X i) in the corpus third-order model, the model of the X i is not the statistics-word occurrence probability P (X i
- the trained language model is to determine the possible descriptions related to the reply message, and the user usually sends out different voice signals to repeat the reply message for reply messages of different lengths, so as to confirm or select the reply message.
- reply messages whose length is less than the reference length
- users often repeat the entire reply message instead of repeating some words in the entire reply message.
- the reply message with a length less than the reference length is used as the corpus to train the N-gram model including the pullback algorithm, it will cause the trained language model to count some word sequences with low occurrence probability, thus affecting the trained language
- the analytical effect of the model can be set based on the scene or experience, and can also be adjusted during the speech recognition process, which is not limited in the embodiment of the present application.
- Oriental Pearl can be used as a reply message with a length of 1. If “Oriental Pearl” is used as the corpus, the trained language model will provide word sequences such as "Dongming" and "Fangzhu", which obviously has a low probability of occurrence. Therefore, in this embodiment, the second language model that does not use the fallback algorithm is obtained according to the reply information whose length is less than the reference length, and the second language model only analyzes the reply information whose entire length is less than the reference length in the keyword.
- the user's expression mode is relatively fixed, and the number of combinations of the class name corresponding to the vocabulary and the referential expression word is relatively limited, so the vocabulary can be corresponding
- the class names, referential expression words, and the combination of class names and referential expression words are used as corpus training to obtain a third language model that does not use the withdrawal algorithm.
- the first language model can parse the entire reply message in the keyword or the word contained in the entire reply message combination. For example, in a car navigation scenario, taking the reference length as 2 as an example, "No. 1, D Avenue, C District, B City, A province” is a reply message with a length greater than the reference length. The user may choose "B City", " D Avenue No. 1” and other word sequences are retelled, so the keywords included in the voice signal retelled by the user can be analyzed by using the first language model of the backward withdrawal algorithm.
- the first language model, the second language model, and the third language model are combined to obtain a total language model, which is It is the core part of the dynamic target language model.
- the total language model and the front-end part are the dynamic target language models.
- the third method of acquisition Obtain the word confusion network based on the response information in the reference format with the length not less than the reference length.
- Each word in the word confusion network has a transition probability; calculate the penalty weight of each word, according to the penalty of each word
- the weight transforms the word confusion network into a weighted finite state transition automaton model, and uses the weighted finite state transition automaton model as the first language model; obtains the second language model according to the reply information of the reference format whose length is less than the reference length; according to the reference vocabulary
- Obtain the third language model merge the first language model, the second language model, and the third language model to obtain the overall language model, and use the overall language model as the dynamic target language model.
- the description of the reference vocabulary can refer to the first acquisition method
- the second language model can be acquired according to the reply information whose length is less than the reference length
- the third language model can be acquired according to the reference vocabulary. Go into details. Next, the process of obtaining the first language model is explained:
- the method of obtaining a word confusion network includes: aligning words of the same category in each reply message whose length is not less than a reference length, and adding one to the number of categories as the number of states in the word confusion network. After that, the state and the state are connected by arcs. Each arc has a word and the corresponding transition probability of the word. The transition probability is used to indicate the frequency of the word in the category of the word, and two adjacent ones The sum of the transition probabilities on all arcs between the states of is 1.
- the penalty weight of each vocabulary is calculated, and the word confusion network is converted to WFST according to the penalty weight to obtain the first language model.
- the first language model will calculate the penalty weights of multiple word sequences that the phoneme sequence of the speech signal may correspond to, and the penalty weight of the word sequence is equal to this
- the product of the penalty weights of the words included in the word sequence, and the word sequence with the smallest penalty weight value will be output.
- the methods for calculating the penalty weight of each word include but are not limited to the following three:
- the first calculation method For any word, the negative logarithm of the transition probability of any word is used as the penalty weight.
- the transition probability of a word is used to indicate the frequency of occurrence of the word in the category of the word.
- Small, that is, the penalty weight is inversely proportional to the frequency of appearance, so that the target language model can better analyze the words that appear frequently in the category.
- the second calculation method For any vocabulary, the logarithm of the number of reply messages in the reference format containing the vocabulary is used as the penalty weight.
- the discriminative strength of words is defined according to the following formula:
- the inverse presence frequency is used to indicate the discriminative strength of the vocabulary.
- T Fi representing a vocabulary for the category F i
- N is the total number of items of reference format reply
- n is the number of words contained in the T Fi reference format reply information. It can be seen that the more the number of reply messages in the reference format containing the vocabulary, the smaller the IPF value and the weaker the discrimination of the vocabulary.
- the IPF (skip) of skip edges can be expressed as:
- the above-mentioned IPF (skip) can also be rewritten to avoid the value of the IPF of the skip side being always equal to zero.
- the rewritten IPF (skip) is expressed in accordance with the following formula:
- the penalty weight Penalty (T Fi ) of the vocabulary can be defined according to the following formula, and the penalty weight of the obtained vocabulary is the logarithmic value of the number of reply messages in the reference format containing the vocabulary:
- Penalty(skip) can be defined as:
- words with strong discrimination that is, words with a small number of reply messages in the reference format containing the words, are given a small penalty weight, so that the target language model can better analyze Come out these distinguishing words.
- the third calculation method For any vocabulary, the logarithm of the number of occurrences of the vocabulary in the reply messages of each reference format is used as the penalty weight.
- the third calculation method can still use the following formula to define the distinction between words:
- N represents the total number of words contained in the response information of each reference format
- n represents the number of occurrences of the word T Fi in the response information of each reference format.
- the first language model, the second language model, and the third language model can be combined to obtain the overall language model, which is the dynamic target The core part of the language model.
- the general language model and the front-end part (or the general language model, the front-end part, and the tail part) can be used as a dynamic target language model.
- the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention, analyzes the voice signal to obtain keywords, and then calls the dynamic target language model to analyze the keywords to obtain the second Intent and service content. Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, prevents the provided service content from deviating from the user's needs, has a good recognition effect, and improves the user's experience.
- the embodiment of the present application also judges whether the voice signal has multiple intentions through the tail part in the dynamic target language model, so as to provide the service indicated by each intention of the user, thereby further improving the user experience.
- an embodiment of the present application also provides a voice recognition device, which includes:
- the first acquisition module 901 is used to acquire or generate a dynamic target language model according to the reply information to the first intention.
- the dynamic target language model includes a front-end part and a core part.
- the core part is used to determine possible descriptions related to the reply information. Used to determine the description of the confirmatory information for the reply message;
- the second acquisition module 902 is configured to acquire a voice signal, analyze the voice signal to generate keywords;
- the first determination module 903 is used to call the dynamic target language model to determine the second intent and service content.
- the front-end part of the dynamic target language model parses the second intent based on keywords, and the core part of the dynamic target language model parses out the second intent based on keywords. Service Content.
- the dynamic target language model further includes a tailing part, which is used to confirm whether there is an additional intention
- the device further includes:
- the second determining module is used to call the dynamic target language model to determine the additional intent, and the tail part of the dynamic target language model parses the additional intent according to the keywords.
- the tail part includes tail marker words
- the second determination module is used for the tailing part to parse out the reference tail marker words according to the keywords and the time point where the reference tail marker words are located; based on the reference tail marker words, combine the first intention and the second intention to update the dynamic target language model, The updated target language model is obtained; the updated target language model is called, and the additional intent is parsed according to the time point of the keyword and the reference tag word.
- the device further includes:
- Cache module used to cache historical voice signals
- the second acquisition module 902 is configured to analyze the voice signal, and use the historical voice signal to perform context detection to generate keywords.
- the device further includes: a confirmation module, configured to confirm the second intention, and obtain the confirmed second intention.
- a confirmation module configured to confirm the second intention, and obtain the confirmed second intention.
- the confirmation module is configured to send confirmation information of the second intention to the user, obtain the second intention fed back by the user, and use the second intention fed back by the user as the confirmed second intention.
- the first obtaining module 901 is configured to convert the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtain or generate a dynamic target language model according to the reply information in the reference format.
- the first acquisition module is used to transform the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as a dynamic target language model, wherein the trained language model is defined by a reference format Response information and reference vocabulary training.
- the first acquisition module 901 is configured to convert the trained language model into a weighted finite state transition automaton model, and use the weighted finite state transition automaton model as the first language model, wherein the trained language model is determined by the length
- the response information in the reference format that is not less than the reference length is obtained through training; the second language model is obtained according to the response information in the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; the first language model and the second language are combined Model and the third language model, get the overall language model, and use the overall language model as the dynamic target language model.
- the first obtaining module 901 includes:
- the first obtaining unit is configured to obtain a word confusion network based on the reply information in a reference format with a length not less than a reference length, and each word in the word confusion network has a transition probability;
- the calculation unit is used to calculate the penalty weight of each word, convert the word confusion network into a weighted finite state transition automaton model according to the penalty weight of each word, and use the weighted finite state transition automaton model as the first language model;
- the second obtaining unit is configured to obtain the second language model according to the reply information in the reference format whose length is less than the reference length, and obtain the third language model according to the reference vocabulary;
- the merging unit is used to merge the first language model, the second language model, and the third language model to obtain the overall language model, and the overall language model is used as the dynamic target language model.
- the calculation unit is configured to use the negative logarithm of the transition probability of the word as the penalty weight for any word.
- the calculation unit is configured to use, for any vocabulary, the logarithm of the number of reply messages in the reference format containing the vocabulary as the penalty weight.
- the calculation unit is configured to use the logarithm of the number of occurrences of the word in the reply message in the reference format as the penalty weight for any word.
- the embodiment of the application obtains or generates a dynamic target language model including a front-end part and a core part according to the reply information of the first intention, analyzes the voice signal to obtain keywords, and then calls the dynamic target language model to analyze the keywords to obtain the second Intent and service content. Since the dynamic target language model is obtained based on the reply information of the first intention, the second intention and service content obtained through the analysis of the dynamic target language model are all related to the first intention. Therefore, the embodiment of the present application realizes the ignorance of voices that are not related to the first intention, prevents the provided service content from deviating from the user's needs, has a good recognition effect, and improves the user's experience.
- the embodiment of the present application also judges whether the voice signal has multiple intentions through the tail part in the dynamic target language model, so as to provide the service indicated by each intention of the user, thereby further improving the user experience.
- An embodiment of the present application also provides a voice recognition device, which includes a memory and a processor; the memory stores at least one instruction, and at least one instruction is loaded and executed by the processor, so as to implement the voice recognition provided by the embodiment of the present application.
- the method includes: obtaining or generating a dynamic target language model according to the reply information to the first intention.
- the dynamic target language model includes a front-end part and a core part.
- the core part is used to determine possible descriptions related to the reply information, and the front-end part uses To determine the description of the confirmatory information of the reply message; obtain the voice signal, analyze the voice signal to generate keywords; call the dynamic target language model to determine the second intention and service content, where the front-end part of the dynamic target language model parses the first part according to the keywords Second intention, the core part of the dynamic target language model parses out the service content based on keywords.
- the dynamic target language model further includes a tailing part, the tailing part is used to confirm whether there is an additional intent, and the method further includes: calling the dynamic target language model to determine the additional intent, and the tailing part of the dynamic target language model parses out the additional intent based on keywords .
- the tail part includes tail tag words; the dynamic target language model is called to determine the additional intent, and the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords and Analyze the additional intention with reference to the time point of the tag word at the end.
- the tail part includes tail tag words
- the dynamic target language model is called to determine the additional intent
- the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords
- the method before acquiring the voice signal, further includes: buffering the historical voice signal; parsing the voice signal to generate keywords, including: parsing the voice signal, and using the historical voice signal for context detection to generate keywords.
- the method further includes: confirming the second intention, and obtaining the confirmed second intention.
- confirming the second intention and obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and using the second intention fed back by the user as the confirmed second intention intention.
- obtaining or generating a dynamic target language model according to the reply information to the first intention includes: converting the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtaining or generating according to the reply information in the reference format Dynamic target language model.
- obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automata model, and using the weighted finite state transition automaton model as the dynamic target language model,
- the trained language model is obtained from the response information in the reference format and reference vocabulary training.
- obtaining or generating the dynamic target language model according to the response information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and using the weighted finite state transition automaton model as the first language model,
- the trained language model is obtained by training the response information in the reference format with a length not less than the reference length; the second language model is obtained according to the response information in the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; merge
- the first language model, the second language model, and the third language model are used to obtain the overall language model, and the overall language model is used as the dynamic target language model.
- obtaining or generating a dynamic target language model according to the reply information in the reference format including: obtaining a word confusion network based on the reply information in the reference format with a length not less than the reference length, and each word in the word confusion network has a transition probability; Calculate the penalty weight of each vocabulary, transform the word confusion network into a weighted finite state transition automaton model according to the penalty weight of each vocabulary, and use the weighted finite state transition automaton model as the first language model; Obtain the second language model according to the response information in the reference format, and obtain the third language model according to the reference vocabulary; merge the first language model, the second language model and the third language model to obtain the overall language model, and use the overall language model as the dynamic target language model .
- calculating the penalty weight of each vocabulary includes: for any vocabulary, taking the negative logarithm of the transition probability of the vocabulary as the penalty weight.
- calculating the penalty weight for each vocabulary includes: for any vocabulary, the logarithm of the number of items of the reply information in the reference format containing the vocabulary is used as the penalty weight.
- calculating the penalty weight of each vocabulary includes: for any vocabulary, the logarithm of the number of occurrences of the vocabulary in the reply message in the reference format is used as the penalty weight.
- the embodiment of the present application also provides a computer-readable storage medium in which at least one instruction is stored.
- the instruction is loaded and executed by a processor to implement a voice recognition method provided by the embodiment of the present application, the method including: Acquire or generate a dynamic target language model based on the response information to the first intention.
- the dynamic target language model includes a front-end part and a core part. The core part is used to determine possible descriptions related to the response information, and the front-end part is used to confirm the confirmation of the response information.
- the dynamic target language model further includes a tailing part, the tailing part is used to confirm whether there is an additional intent, and the method further includes: calling the dynamic target language model to determine the additional intent, and the tailing part of the dynamic target language model parses out the additional intent based on keywords .
- the tail part includes tail tag words; the dynamic target language model is called to determine the additional intent, and the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords and Analyze the additional intention with reference to the time point of the tag word at the end.
- the tail part includes tail tag words
- the dynamic target language model is called to determine the additional intent
- the tail part of the dynamic target language model parses the additional intent based on keywords, including: the tail part resolves the reference tail tag words according to the keywords, and Refer to the time point where the tail tag word is located; based on the reference tail tag word, combined with the first intention and second intention, update the dynamic target language model to obtain the updated target language model; call the updated target language model, according to the keywords
- the method before acquiring the voice signal, further includes: buffering the historical voice signal; parsing the voice signal to generate keywords, including: parsing the voice signal, and using the historical voice signal for context detection to generate keywords.
- the method further includes: confirming the second intention, and obtaining the confirmed second intention.
- confirming the second intention and obtaining the confirmed second intention includes: sending confirmation information of the second intention to the user, obtaining the second intention fed back by the user, and using the second intention fed back by the user as the confirmed second intention intention.
- obtaining or generating a dynamic target language model according to the reply information to the first intention includes: converting the reply information of the first intention into a reference format to obtain reply information in the reference format, and obtaining or generating according to the reply information in the reference format Target language model.
- obtaining or generating the dynamic target language model according to the reply information in the reference format includes: converting the trained language model into a weighted finite state transition automata model, and using the weighted finite state transition automaton model as the dynamic target language model,
- the trained language model is obtained from the response information in the reference format and reference vocabulary training.
- obtaining or generating the dynamic target language model according to the response information in the reference format includes: converting the trained language model into a weighted finite state transition automaton model, and using the weighted finite state transition automaton model as the first language model,
- the trained language model is obtained by training the response information in the reference format with a length not less than the reference length; the second language model is obtained according to the response information in the reference format with the length less than the reference length, and the third language model is obtained according to the reference vocabulary; merge
- the first language model, the second language model, and the third language model are used to obtain the overall language model, and the overall language model is used as the dynamic target language model.
- obtaining or generating a dynamic target language model according to the reply information in the reference format including: obtaining a word confusion network based on the reply information in the reference format with a length not less than the reference length, and each word in the word confusion network has a transition probability; Calculate the penalty weight of each vocabulary, transform the word confusion network into a weighted finite state transition automaton model according to the penalty weight of each vocabulary, and use the weighted finite state transition automaton model as the first language model; Obtain the second language model according to the response information in the reference format, and obtain the third language model according to the reference vocabulary; merge the first language model, the second language model and the third language model to obtain the overall language model, and use the overall language model as the dynamic target language model .
- calculating the penalty weight of each vocabulary includes: for any vocabulary, the negative logarithm value of the transition probability of the vocabulary is used as the penalty weight.
- calculating the penalty weight for each vocabulary includes: for any vocabulary, the logarithm of the number of items of the reply information in the reference format containing the vocabulary is used as the penalty weight.
- calculating the penalty weight of each vocabulary includes: for any vocabulary, the logarithm of the number of occurrences of the vocabulary in the reply message in the reference format is used as the penalty weight.
- An embodiment of the present application also provides a chip, including a processor, configured to call and execute instructions stored in the memory from the memory, so that the communication device installed with the chip executes any of the above-mentioned voice recognition methods .
- the embodiment of the present application also provides another chip, including: an input interface, an output interface, a processor, and a memory.
- the input interface, the output interface, the processor, and the memory are connected by an internal connection path.
- the processor is configured to execute the code in the memory, and when the code is executed, the processor is configured to execute any one of the aforementioned voice recognition methods.
- processors there are one or more processors and one or more memories.
- the memory may be integrated with the processor, or the memory and the processor may be provided separately.
- the memory and the processor may be integrated on the same chip, or they may be separately arranged on different chips.
- the embodiment of the present application does not limit the type of the memory and the arrangement of the memory and the processor.
- processor may be a central processing unit (CPU), or other general-purpose processors, digital signal processing (digital signal processing, DSP), and application specific integrated circuits. ASIC), field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or any conventional processor. It is worth noting that the processor may be a processor that supports an advanced reduced instruction set machine (advanced RISC machines, ARM) architecture.
- the foregoing memory may include a read-only memory and a random access memory, and provide instructions and data to the processor.
- the memory may also include non-volatile random access memory.
- the memory can also store device type information.
- the memory may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
- the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electronic Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
- the volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not limiting illustration, many forms of RAM are available.
- static random access memory static random access memory
- dynamic random access memory dynamic random access memory
- DRAM dynamic random access memory
- SDRAM synchronous dynamic random access memory
- double data rate synchronous dynamic random access Memory double data date SDRAM, DDR SDRAM
- enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
- serial link DRAM SLDRAM
- direct memory bus random access memory direct rambus RAM
- the embodiments of the present application provide a computer program.
- the processor or the computer can execute the corresponding steps and/or processes in the foregoing method embodiments.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk).
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (44)
- 一种语音识别的方法,其特征在于,所述方法包括:根据对第一意图的回复信息获取或生成动态目标语言模型,所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;获取语音信号,解析所述语音信号生成关键词;调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
- 根据权利要求1所述的方法,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述方法还包括:调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
- 根据权利要求2所述的方法,其特征在于,所述接尾部分包括接尾标志词;所述调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图,包括:所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
- 根据权利要求1-3任一所述的方法,其特征在于,所述获取语音信号之前,所述方法还包括:缓存历史语音信号;所述解析所述语音信号生成关键词,包括:解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
- 根据权利要求1-4任一所述的方法,其特征在于,所述根据对第一意图的回复信息获取或生成动态目标语言模型,包括:将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
- 根据权利要求5所述的方法,其特征在于,所述根据所述参考格式的回复信息获取或 生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
- 根据权利要求5所述的方法,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求5所述的方法,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求8所述的方法,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
- 根据权利要求8所述的方法,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将包含所述词汇的参考格式的回复信息的项数的对数值作为所述惩罚权重。
- 根据权利要求8所述的方法,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
- 一种语音识别的装置,其特征在于,所述装置包括:第一获取模块,用于根据对第一意图的回复信息获取或生成动态目标语言模型,所述动 态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;第二获取模块,用于获取语音信号,解析所述语音信号生成关键词;第一确定模块,用于调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
- 根据权利要求12所述的装置,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述装置还包括:第二确定模块,用于调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
- 根据权利要求13所述的装置,其特征在于,所述接尾部分包括接尾标志词;所述第二确定模块,用于所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
- 根据权利要求12-14任一所述的装置,其特征在于,所述装置还包括:缓存模块,用于缓存历史语音信号;所述第二获取模块,用于解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
- 根据权利要求12-15任一所述的装置,其特征在于,所述第一获取模块,用于将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
- 根据权利要求16所述的装置,其特征在于,所述第一获取模块,用于将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
- 根据权利要求16所述的装置,其特征在于,所述第一获取模块,用于将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求16所述的装置,其特征在于,所述第一获取模块,包括:第一获取单元,用于基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;计算单元,用于计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;第二获取单元,用于根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并单元,用于合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求19所述的装置,其特征在于,所述计算单元,用于对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
- 根据权利要求19所述的装置,其特征在于,所述计算单元,用于对于任一词汇,将包含所述词汇的参考格式的回复信息的数量的对数值作为所述惩罚权重。
- 根据权利要求19所述的装置,其特征在于,所述计算单元,用于对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
- 一种语音识别的设备,其特征在于,所述设备包括存储器及处理器;所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行,以实现一种语音识别的方法,所述方法包括:根据对第一意图的回复信息获取或生成动态目标语言模型,所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;获取语音信号,解析所述语音信号生成关键词;调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
- 根据权利要求23所述的语音识别的设备,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述方法还包括:调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
- 根据权利要求24所述的语音识别的设备,其特征在于,所述接尾部分包括接尾标志词;所述调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分 根据所述关键词解析出所述附加意图,包括:所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
- 根据权利要求23-25任一所述的语音识别的设备,其特征在于,所述获取语音信号之前,所述方法还包括:缓存历史语音信号;所述解析所述语音信号生成关键词,包括:解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
- 根据权利要求23-26任一所述的语音识别的设备,其特征在于,所述根据对第一意图的回复信息获取或生成动态目标语言模型,包括:将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
- 根据权利要求27所述的语音识别的设备,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
- 根据权利要求27所述的语音识别的设备,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求27所述的语音识别的设备,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求30所述的语音识别的设备,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
- 根据权利要求30所述的语音识别的设备,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将包含所述词汇的参考格式的回复信息的项数的对数值作为所述惩罚权重。
- 根据权利要求30所述的语音识别的设备,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现一种语音识别的方法,所述方法包括:根据对第一意图的回复信息获取或生成动态目标语言模型,所述动态目标语言模型包括前端部分和核心部分,所述核心部分用于确定对所述回复信息相关的可能描述,所述前端部分用于确定对所述回复信息的确认性信息的描述;获取语音信号,解析所述语音信号生成关键词;调用所述动态目标语言模型确定第二意图和服务内容,其中所述动态目标语言模型的所述前端部分根据所述关键词解析出所述第二意图,所述动态目标语言模型的核心部分根据所述关键词解析出所述服务内容。
- 根据权利要求34所述的计算机可读存储介质,其特征在于,所述动态目标语言模型还包括接尾部分,所述接尾部分用于确认是否存在附加意图,所述方法还包括:调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分根据所述关键词解析出所述附加意图。
- 根据权利要求35所述的计算机可读存储介质,其特征在于,所述接尾部分包括接尾标志词;所述调用所述动态目标语言模型确定附加意图,所述动态目标语言模型的所述接尾部分 根据所述关键词解析出所述附加意图,包括:所述接尾部分根据所述关键词解析出参考接尾标志词,以及所述参考接尾标志词所在的时间点;基于所述参考接尾标志词,结合所述第一意图和所述第二意图,更新所述动态目标语言模型,得到更新后的目标语言模型;调用所述更新后的目标语言模型,根据所述关键词和所述参考接尾标志词所在的时间点解析出所述附加意图。
- 根据权利要求34-36任一所述的计算机可读存储介质,其特征在于,所述获取语音信号之前,所述方法还包括:缓存历史语音信号;所述解析所述语音信号生成关键词,包括:解析所述语音信号,利用所述历史语音信号进行上下文检测后生成所述关键词。
- 根据权利要求34-37任一所述的计算机可读存储介质,其特征在于,所述根据对第一意图的回复信息获取或生成动态目标语言模型,包括:将所述第一意图的回复信息转换为参考格式,得到参考格式的回复信息,根据所述参考格式的回复信息获取或生成所述动态目标语言模型。
- 根据权利要求38所述的计算机可读存储介质,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为所述动态目标语言模型,其中所述已训练的语言模型由所述参考格式的回复信息以及参考词汇训练获得。
- 根据权利要求38所述的计算机可读存储介质,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:将已训练的语言模型转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型,其中所述已训练的语言模型由长度不低于参考长度的参考格式的回复信息训练获得;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求38所述的计算机可读存储介质,其特征在于,所述根据所述参考格式的回复信息获取或生成所述动态目标语言模型,包括:基于长度不低于参考长度的参考格式的回复信息获取词混淆网络,所述词混淆网络中的每个词汇有一转移概率;计算每个词汇的惩罚权重,根据所述每个词汇的惩罚权重将所述词混淆网络转化为加权有限状态转换自动机模型,将所述加权有限状态转换自动机模型作为第一语言模型;根据长度低于参考长度的参考格式的回复信息获取第二语言模型,根据参考词汇获取第三语言模型;合并所述第一语言模型、所述第二语言模型和所述第三语言模型,得到总语言模型,将所述总语言模型作为所述动态目标语言模型。
- 根据权利要求41所述的计算机可读存储介质,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇的转移概率的负对数值作为所述惩罚权重。
- 根据权利要求41所述的计算机可读存储介质,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将包含所述词汇的参考格式的回复信息的项数的对数值作为所述惩罚权重。
- 根据权利要求41所述的计算机可读存储介质,其特征在于,所述计算每个词汇的惩罚权重,包括:对于任一词汇,将所述词汇在所述参考格式的回复信息中的出现次数的对数值作为所述惩罚权重。
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP20814489.9A EP3965101B1 (en) | 2019-05-31 | 2020-03-16 | Speech recognition method, apparatus and device, and computer-readable storage medium |
| JP2021570241A JP7343087B2 (ja) | 2019-05-31 | 2020-03-16 | 音声認識の方法、装置、およびデバイス、並びにコンピュータ可読記憶媒体 |
| EP25192045.0A EP4682866A3 (en) | 2019-05-31 | 2020-03-16 | Speech recognition method, apparatus, and device, and computer-readable storage medium |
| US17/539,005 US12087289B2 (en) | 2019-05-31 | 2021-11-30 | Speech recognition method, apparatus, and device, and computer-readable storage medium |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910470966.4A CN112017642B (zh) | 2019-05-31 | 2019-05-31 | 语音识别的方法、装置、设备及计算机可读存储介质 |
| CN201910470966.4 | 2019-05-31 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/539,005 Continuation US12087289B2 (en) | 2019-05-31 | 2021-11-30 | Speech recognition method, apparatus, and device, and computer-readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020238341A1 true WO2020238341A1 (zh) | 2020-12-03 |
Family
ID=73501103
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/079522 Ceased WO2020238341A1 (zh) | 2019-05-31 | 2020-03-16 | 语音识别的方法、装置、设备及计算机可读存储介质 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US12087289B2 (zh) |
| EP (2) | EP3965101B1 (zh) |
| JP (1) | JP7343087B2 (zh) |
| CN (2) | CN118379989A (zh) |
| WO (1) | WO2020238341A1 (zh) |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112331210B (zh) * | 2021-01-05 | 2021-05-18 | 太极计算机股份有限公司 | 一种语音识别装置 |
| US11984125B2 (en) * | 2021-04-23 | 2024-05-14 | Cisco Technology, Inc. | Speech recognition using on-the-fly-constrained language model per utterance |
| US12300223B2 (en) * | 2022-01-04 | 2025-05-13 | Sap Se | Support for syntax analysis during processing instructions for execution |
| CN114882886B (zh) * | 2022-04-27 | 2024-10-01 | 卡斯柯信号有限公司 | Ctc仿真实训语音识别处理方法、存储介质和电子设备 |
| CN115810359B (zh) * | 2022-09-28 | 2025-12-26 | 海尔优家智能科技(北京)有限公司 | 语音的识别方法和装置、存储介质及电子装置 |
| CN117112065B (zh) * | 2023-08-30 | 2024-06-25 | 北京百度网讯科技有限公司 | 大模型插件调用方法、装置、设备及介质 |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101656799A (zh) * | 2008-08-20 | 2010-02-24 | 阿鲁策株式会社 | 自动会话系统以及会话情节编辑装置 |
| US8990085B2 (en) * | 2009-09-30 | 2015-03-24 | At&T Intellectual Property I, L.P. | System and method for handling repeat queries due to wrong ASR output by modifying an acoustic, a language and a semantic model |
| CN105529030A (zh) * | 2015-12-29 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | 语音识别处理方法和装置 |
| CN105590626A (zh) * | 2015-12-29 | 2016-05-18 | 百度在线网络技术(北京)有限公司 | 持续语音人机交互方法和系统 |
| CN105632495A (zh) * | 2015-12-30 | 2016-06-01 | 百度在线网络技术(北京)有限公司 | 语音识别方法和装置 |
| CN106486120A (zh) * | 2016-10-21 | 2017-03-08 | 上海智臻智能网络科技股份有限公司 | 交互式语音应答方法及应答系统 |
| CN109616108A (zh) * | 2018-11-29 | 2019-04-12 | 北京羽扇智信息科技有限公司 | 多轮对话交互处理方法、装置、电子设备及存储介质 |
Family Cites Families (28)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5384892A (en) * | 1992-12-31 | 1995-01-24 | Apple Computer, Inc. | Dynamic language model for speech recognition |
| US6754626B2 (en) * | 2001-03-01 | 2004-06-22 | International Business Machines Corporation | Creating a hierarchical tree of language models for a dialog system based on prompt and dialog context |
| US7328155B2 (en) * | 2002-09-25 | 2008-02-05 | Toyota Infotechnology Center Co., Ltd. | Method and system for speech recognition using grammar weighted based upon location information |
| US20040148170A1 (en) * | 2003-01-23 | 2004-07-29 | Alejandro Acero | Statistical classifiers for spoken language understanding and command/control scenarios |
| JP3991914B2 (ja) * | 2003-05-08 | 2007-10-17 | 日産自動車株式会社 | 移動体用音声認識装置 |
| JP2006023345A (ja) * | 2004-07-06 | 2006-01-26 | Alpine Electronics Inc | テレビ画像自動キャプチャー方法及び装置 |
| US7228278B2 (en) * | 2004-07-06 | 2007-06-05 | Voxify, Inc. | Multi-slot dialog systems and methods |
| US7716056B2 (en) | 2004-09-27 | 2010-05-11 | Robert Bosch Corporation | Method and system for interactive conversational dialogue for cognitively overloaded device users |
| JP4846336B2 (ja) * | 2005-10-21 | 2011-12-28 | 株式会社ユニバーサルエンターテインメント | 会話制御装置 |
| US8239129B2 (en) | 2009-07-27 | 2012-08-07 | Robert Bosch Gmbh | Method and system for improving speech recognition accuracy by use of geographic information |
| KR20100012051A (ko) * | 2010-01-12 | 2010-02-04 | 주식회사 다날 | 스타 음성 메시지 청취 시스템 |
| US8938391B2 (en) * | 2011-06-12 | 2015-01-20 | Microsoft Corporation | Dynamically adding personalization features to language models for voice search |
| US9082403B2 (en) * | 2011-12-15 | 2015-07-14 | Microsoft Technology Licensing, Llc | Spoken utterance classification training for a speech recognition system |
| KR101759009B1 (ko) * | 2013-03-15 | 2017-07-17 | 애플 인크. | 적어도 부분적인 보이스 커맨드 시스템을 트레이닝시키는 것 |
| JP6280342B2 (ja) * | 2013-10-22 | 2018-02-14 | 株式会社Nttドコモ | 機能実行指示システム及び機能実行指示方法 |
| US9286892B2 (en) * | 2014-04-01 | 2016-03-15 | Google Inc. | Language modeling in speech recognition |
| US10460720B2 (en) * | 2015-01-03 | 2019-10-29 | Microsoft Technology Licensing, Llc. | Generation of language understanding systems and methods |
| WO2017112813A1 (en) * | 2015-12-22 | 2017-06-29 | Sri International | Multi-lingual virtual personal assistant |
| US10832664B2 (en) * | 2016-08-19 | 2020-11-10 | Google Llc | Automated speech recognition using language models that selectively use domain-specific model components |
| US10217458B2 (en) * | 2016-09-23 | 2019-02-26 | Intel Corporation | Technologies for improved keyword spotting |
| CN106448670B (zh) * | 2016-10-21 | 2019-11-19 | 竹间智能科技(上海)有限公司 | 基于深度学习和强化学习的自动回复对话系统 |
| CN107240394A (zh) * | 2017-06-14 | 2017-10-10 | 北京策腾教育科技有限公司 | 一种动态自适应语音分析技术以用于人机口语考试的方法及系统 |
| KR20190004495A (ko) * | 2017-07-04 | 2019-01-14 | 삼성에스디에스 주식회사 | 챗봇을 이용한 태스크 처리 방법, 장치 및 시스템 |
| US10083006B1 (en) * | 2017-09-12 | 2018-09-25 | Google Llc | Intercom-style communication using multiple computing devices |
| CN108735215A (zh) * | 2018-06-07 | 2018-11-02 | 爱驰汽车有限公司 | 车载语音交互系统、方法、设备和存储介质 |
| CN109003611B (zh) * | 2018-09-29 | 2022-05-27 | 阿波罗智联(北京)科技有限公司 | 用于车辆语音控制的方法、装置、设备和介质 |
| US11004449B2 (en) * | 2018-11-29 | 2021-05-11 | International Business Machines Corporation | Vocal utterance based item inventory actions |
| US10997968B2 (en) * | 2019-04-30 | 2021-05-04 | Microsofttechnology Licensing, Llc | Using dialog context to improve language understanding |
-
2019
- 2019-05-31 CN CN202410408435.3A patent/CN118379989A/zh active Pending
- 2019-05-31 CN CN201910470966.4A patent/CN112017642B/zh active Active
-
2020
- 2020-03-16 WO PCT/CN2020/079522 patent/WO2020238341A1/zh not_active Ceased
- 2020-03-16 EP EP20814489.9A patent/EP3965101B1/en active Active
- 2020-03-16 EP EP25192045.0A patent/EP4682866A3/en active Pending
- 2020-03-16 JP JP2021570241A patent/JP7343087B2/ja active Active
-
2021
- 2021-11-30 US US17/539,005 patent/US12087289B2/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101656799A (zh) * | 2008-08-20 | 2010-02-24 | 阿鲁策株式会社 | 自动会话系统以及会话情节编辑装置 |
| US8990085B2 (en) * | 2009-09-30 | 2015-03-24 | At&T Intellectual Property I, L.P. | System and method for handling repeat queries due to wrong ASR output by modifying an acoustic, a language and a semantic model |
| CN105529030A (zh) * | 2015-12-29 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | 语音识别处理方法和装置 |
| CN105590626A (zh) * | 2015-12-29 | 2016-05-18 | 百度在线网络技术(北京)有限公司 | 持续语音人机交互方法和系统 |
| CN105632495A (zh) * | 2015-12-30 | 2016-06-01 | 百度在线网络技术(北京)有限公司 | 语音识别方法和装置 |
| CN106486120A (zh) * | 2016-10-21 | 2017-03-08 | 上海智臻智能网络科技股份有限公司 | 交互式语音应答方法及应答系统 |
| CN109616108A (zh) * | 2018-11-29 | 2019-04-12 | 北京羽扇智信息科技有限公司 | 多轮对话交互处理方法、装置、电子设备及存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3965101C0 (en) | 2025-08-27 |
| JP2022534242A (ja) | 2022-07-28 |
| US12087289B2 (en) | 2024-09-10 |
| EP4682866A2 (en) | 2026-01-21 |
| EP3965101A1 (en) | 2022-03-09 |
| CN112017642A (zh) | 2020-12-01 |
| EP3965101B1 (en) | 2025-08-27 |
| EP4682866A3 (en) | 2026-03-11 |
| CN118379989A (zh) | 2024-07-23 |
| JP7343087B2 (ja) | 2023-09-12 |
| US20220093087A1 (en) | 2022-03-24 |
| CN112017642B (zh) | 2024-04-26 |
| EP3965101A4 (en) | 2022-06-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11887604B1 (en) | Speech interface device with caching component | |
| CN112017642B (zh) | 语音识别的方法、装置、设备及计算机可读存储介质 | |
| US11727917B1 (en) | Silent phonemes for tracking end of speech | |
| US9905228B2 (en) | System and method of performing automatic speech recognition using local private data | |
| US7689420B2 (en) | Personalizing a context-free grammar using a dictation language model | |
| KR102201937B1 (ko) | 후속 음성 쿼리 예측 | |
| CN102737096B (zh) | 基于位置的会话理解 | |
| US20070239453A1 (en) | Augmenting context-free grammars with back-off grammars for processing out-of-grammar utterances | |
| US9986394B1 (en) | Voice-based messaging | |
| US10838954B1 (en) | Identifying user content | |
| US20180374478A1 (en) | Speech recognition method and device | |
| US20240347055A1 (en) | Automatic synchronization for an offline virtual assistant | |
| CN110956955A (zh) | 一种语音交互的方法和装置 | |
| US10866948B2 (en) | Address book management apparatus using speech recognition, vehicle, system and method thereof | |
| EP4325483B1 (en) | Speech interaction method, server, and storage medium | |
| JP2018040904A (ja) | 音声認識装置および音声認識方法 | |
| JP2022121386A (ja) | テキストベースの話者変更検出を活用した話者ダイアライゼーション補正方法およびシステム | |
| US20240212687A1 (en) | Supplemental content output | |
| CN116110396B (zh) | 语音交互方法、服务器和计算机可读存储介质 | |
| US11935533B1 (en) | Content-related actions based on context | |
| CN112863496B (zh) | 一种语音端点检测方法以及装置 | |
| US10304454B2 (en) | Persistent training and pronunciation improvements through radio broadcast | |
| US11277304B1 (en) | Wireless data protocol | |
| JP2020012860A (ja) | 音声認識装置および音声認識方法 | |
| JP2019095606A (ja) | 学習データ生成方法、学習データ生成プログラム、サーバ |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20814489 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021570241 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2020814489 Country of ref document: EP Effective date: 20211202 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2020814489 Country of ref document: EP |