WO2021164244A1 - 一种语音交互方法、装置、设备和计算机存储介质 - Google Patents

一种语音交互方法、装置、设备和计算机存储介质 Download PDF

Info

Publication number
WO2021164244A1
WO2021164244A1 PCT/CN2020/116018 CN2020116018W WO2021164244A1 WO 2021164244 A1 WO2021164244 A1 WO 2021164244A1 CN 2020116018 W CN2020116018 W CN 2020116018W WO 2021164244 A1 WO2021164244 A1 WO 2021164244A1
Authority
WO
WIPO (PCT)
Prior art keywords
demand
user
sentence
voice
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/116018
Other languages
English (en)
French (fr)
Inventor
王海峰
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to KR1020217032708A priority Critical patent/KR20210137531A/ko
Priority to EP20864285.0A priority patent/EP3896690B1/en
Priority to US17/279,540 priority patent/US11978447B2/en
Priority to JP2021571465A priority patent/JP2022531987A/ja
Publication of WO2021164244A1 publication Critical patent/WO2021164244A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • This application relates to the field of computer application technology, in particular to a voice interaction method, device, equipment and computer storage medium in the field of artificial intelligence.
  • voice interaction technology users can interact with terminal devices such as smart speakers and smart phones.
  • terminal devices such as smart speakers and smart phones.
  • voice assistant that comes with the terminal device operating system
  • more and more applications are equipped with voice interaction technology. Users can obtain corresponding services by inputting voice commands, thereby freeing their hands to a large extent.
  • the present application provides a voice interaction method, device, device, and computer storage medium, so as to improve user interaction efficiency and improve user experience.
  • this application provides a voice interaction method, which includes:
  • a demand analysis result corresponding to the demand expression determined by the user is used to perform a service response.
  • the method further includes:
  • the demand analysis result corresponding to the first voice command is used to perform a service response.
  • performing demand prediction on the first voice command to obtain at least one demand statement includes:
  • the first voice command is input into a pre-trained demand prediction model, and the demand prediction model maps the first voice command to at least one demand statement.
  • the demand prediction model is obtained by pre-training in the following manner:
  • the training data including a plurality of sentence pairs, the sentence pairs including a first sentence and a second sentence, wherein the second sentence can be successfully parsed by demand;
  • the sequence-to-sequence Seq2Seq model is trained using the training data to obtain the demand prediction model, wherein the first sentence in the sentence pair is used as the input of the Seq2Seq model, and the second sentence is used as the target output of the Seq2Seq model.
  • the training data is obtained from a text search log
  • the text search request query is used as the first sentence, and the clicked search result corresponding to the query is used to obtain the second sentence.
  • the first sentence and the second sentence form a sentence pair, and the confidence of the second sentence is determined by the first sentence.
  • the number of clicks of the second sentence is determined.
  • returning at least one of the requirement expressions to the user in the form of inquiry includes:
  • the first requirement expression is returned to the user in the form of inquiry.
  • returning at least one of the requirement expressions to the user in the form of inquiry further includes:
  • the second requirement expression is returned to the user in the form of inquiry.
  • returning at least one of the requirement expressions to the user in the form of inquiry includes:
  • At least one demand statement with a confidence ranking among the top N demand statements obtained by mapping the demand prediction model is returned to the user in the form of inquiry, where N is a preset positive integer.
  • the method further includes:
  • the reason for the failure of the demand analysis is analyzed, and the reason for the failure of the demand analysis is further carried in the query.
  • the reasons for the failure of the requirement analysis include:
  • the environment is noisy, the length of the first voice command exceeds the limit, the pronunciation of the first voice command is inaccurate, or the first voice command is spoken.
  • the present application provides a voice interaction device, which includes:
  • the voice interaction unit is used to receive the first voice instruction input by the user
  • a voice processing unit configured to perform voice recognition and demand analysis on the first voice instruction
  • a demand prediction unit configured to, if the demand analysis fails, perform demand prediction on the first voice command to obtain at least one demand statement
  • the voice interaction unit is further configured to return at least one of the demand expressions to the user in the form of inquiry;
  • the service response unit is configured to, if the voice interaction unit receives a second voice instruction that the user determines at least one of the demand expressions, use the demand analysis result corresponding to the demand expression determined by the user to perform a service response.
  • the service response unit is further configured to, if the demand analysis succeeds, use the demand analysis result corresponding to the first voice command to perform a service response.
  • the demand prediction unit is specifically configured to input the first voice command into a pre-trained demand prediction model, and the demand prediction model maps the first voice command to at least one demand Express.
  • the device further includes:
  • the model training unit is used to obtain training data, the training data includes a plurality of sentence pairs, the sentence pairs include a first sentence and a second sentence, wherein the second sentence can be successfully parsed by demand; training Seq2Seq using the training data Model to obtain the demand prediction model, wherein the first sentence in the sentence pair is used as the input of the Seq2Seq model, and the second sentence is used as the target output of the Seq2Seq model.
  • the voice interaction unit when the voice interaction unit returns at least one of the requirement expressions to the user in the form of inquiry, it specifically executes:
  • the first requirement expression is returned to the user in the form of inquiry.
  • the voice interaction unit when the voice interaction unit returns at least one of the requirement expressions to the user in the form of inquiry, it is also used to:
  • the second requirement expression is returned to the user in the form of inquiry.
  • the voice interaction unit when the voice interaction unit returns at least one of the requirement expressions to the user in the form of inquiry, it specifically executes:
  • At least one demand statement with a confidence ranking among the top N demand statements obtained by mapping the demand prediction model is returned to the user in the form of inquiry, where N is a preset positive integer.
  • the device further includes:
  • the reason analysis unit is configured to analyze the reason for the failure of the requirement analysis, and further carry the reason for the failure of the requirement analysis in the inquiry.
  • this application provides an electronic device, including:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method as described in any one of the above.
  • the present application provides a non-transitory computer-readable storage medium storing computer instructions, and the computer instructions are used to make the computer execute the method described in any one of the preceding items.
  • this application after failing to parse the voice command demand input by the user, will further predict the voice command demand, and "guess" the user's possible demand expression to return to the user for confirmation, rather than simply and rudely Inform the user that they cannot understand what the user is saying, thereby improving the user's interaction efficiency and improving the user experience.
  • FIG. 1 shows an exemplary system architecture of a voice interaction method or voice interaction device to which an embodiment of the present invention can be applied;
  • FIG. 2 is a flowchart of a voice interaction method provided by Embodiment 1 of this application;
  • FIG. 3 is a flowchart of a voice interaction method provided in Embodiment 2 of this application.
  • FIG. 4 is a flowchart of a voice interaction method provided in Embodiment 3 of this application.
  • FIG. 5 is a structural diagram of a voice interaction device provided in Embodiment 4 of this application.
  • Fig. 6 is a block diagram of an electronic device used to implement the voice interaction method of an embodiment of the present application.
  • Fig. 1 shows an exemplary system architecture of a voice interaction method or voice interaction device to which an embodiment of the present invention can be applied.
  • the system architecture may include terminal devices 101 and 102, a network 103 and a server 104.
  • the network 103 is used to provide a medium for communication links between the terminal devices 101 and 102 and the server 104.
  • the network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101 and 102 to interact with the server 104 through the network 103.
  • Various applications may be installed on the terminal devices 101 and 102, such as voice interactive applications, web browser applications, and communication applications.
  • the terminal devices 101 and 102 may be various electronic devices that support voice interaction, and may be devices with screens or devices without screens. Including but not limited to smart phones, tablets, smart speakers, smart TVs, etc.
  • the voice interaction device provided by the present invention can be set up and run in the aforementioned server 104, and can also be set up and run in terminal devices 101 and 102 with powerful processing functions. It can be implemented as multiple software or software modules (for example, to provide distributed services), or as a single software or software module, which is not specifically limited here.
  • the voice interaction device is installed and operated in the server 104, and the terminal device 101 sends the voice command input by the user to the server 104 through the network 103.
  • the server 104 uses the method provided in the embodiment of the present invention to perform processing, the processing result is returned to the terminal device 101, and then the terminal device 101 provides the user with the voice interaction with the user.
  • the server 104 may be a single server or a server group composed of multiple servers. It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, and servers according to implementation needs.
  • voice interaction scenario After voice recognition and demand analysis are performed on the voice instructions input by the user, if the demand analysis fails, the user will be returned with a result that does not understand the user's needs, or the user is asked to return a result that prompts the user to change the voice instruction.
  • voice interaction scenario the following voice interaction scenario:
  • XX (representing the name of the voice assistant, such as Baidu's Xiaodu, Huawei's Xiaoai, Facebook's Tmall Elf, etc.) temporarily do not understand what you said.
  • the voice assistant still fails to understand the user's needs after repeated attempts and changes, and the user easily loses patience.
  • This way of interaction is obviously very inefficient for users, and the user experience is also very poor.
  • the core idea of this application is that after the voice command input by the user undergoes voice recognition and demand analysis, if the demand analysis fails, the voice command will be further forecasted, and the user’s possible demand expression will be “guessed” and returned to The user confirms, rather than simply and rudely telling the user that he cannot understand what the user is saying.
  • the method provided in this application will be described in detail below in conjunction with embodiments.
  • Fig. 2 is a flowchart of the voice interaction method provided in the first embodiment of this application. As shown in Fig. 2, the method may include the following steps:
  • voice recognition and demand analysis are performed on the first voice instruction input by the user.
  • the first voice instruction may be the first voice instruction input by the user after waking up the voice assistant. It can also be a voice command entered in certain specific scenarios.
  • This application does not have the "first”, “second”, “third”, etc., related to voice commands, such as “first voice command”, “second voice command” and “third voice command” Restrictions on order, number and name are only used to distinguish different voice commands.
  • the demand prediction is performed on the first voice command to obtain at least one demand statement.
  • demand analysis may fail due to various reasons, that is, it is impossible to accurately obtain the user's demand type, structured information, etc.
  • this application does not simply and rudely inform the user that the analysis fails, but performs demand prediction on the first voice command, that is, guesses the user's demand, and returns at least one demand expression obtained by the prediction to the user.
  • At least one of the demand expressions is returned to the user in the form of inquiry.
  • a demand analysis result corresponding to the demand expression determined by the user is used to perform a service response.
  • multiple implementation methods can be used. For example, multiple rounds of interaction can be used to return a demand statement to the user at a time, and then a general question sentence can be used to inquire the user. If one of the demand statements is confirmed, Then use the demand analysis result corresponding to the determined demand expression to perform service response. If a negative answer is obtained, it will continue to return a demand statement to the user in the next round by means of general question sentences, and so on, until the preset maximum number of interaction rounds is reached. This method will be described in detail in the second embodiment later.
  • Fig. 3 is a flowchart of the voice interaction method provided in the second embodiment of the application. As shown in Fig. 3, the method may include the following steps:
  • the demand analysis result corresponding to the first voice command is used to perform a service response, and this voice interaction process is ended.
  • the demand analysis result can be directly used for service response without multiple rounds of interaction.
  • the first voice command is input into a pre-trained demand prediction model, and the demand prediction model maps the first voice command to at least one demand statement, and takes the demand statement with the highest confidence level as the first demand statement.
  • a demand prediction model obtained by pre-training may be used to perform demand prediction.
  • the demand prediction model can map the first voice instruction to a variety of demand expressions close to the first voice instruction, so as to "guess" the user demand represented by the first voice instruction.
  • At least one demand statement obtained by the demand prediction model mapping has a confidence degree, and the confidence degree represents the accuracy with which the demand corresponding to the first voice command can be predicted.
  • a preferred implementation manner is introduced here for the training process of the demand prediction model.
  • the number of search results returned will not be large. Usually only when the user needs are clear, the corresponding needs are returned.
  • search results For text search, there is no limitation in this regard. As long as the user enters a text search request query, a large number of search results will be returned to the user based on the similarity ranking. The user can find what he needs from the many search results, click and even further obtain vertical services.
  • the content of the text search log can be used as a basis to extract the demand expression corresponding to the user's query. That is, the training data is obtained from the text search log to train the demand prediction model.
  • the training process of the demand prediction model may include the following steps:
  • Step S1. Obtain training data.
  • the training data includes a large number of sentence pairs.
  • Each sentence pair includes two sentences: a first sentence and a second sentence.
  • the second sentence can be successfully parsed by demand, that is, the second sentence
  • the expression adopted by the sentence is the expression that can clarify the demand after the demand analysis.
  • Step S2 Use the training data Seq2Seq (training sequence to sequence) model to obtain a demand prediction model, where the first sentence in the sentence pair is used as the input of the Seq2Seq model, and the second sentence is used as the target output of the Seq2Seq model.
  • the training data can be obtained from the text search log first.
  • other methods may also be used, such as a method of artificially constructing training data. This application only takes the training data obtained from the text search log as an example for detailed description.
  • the text search log you can use the text search request input by the user as the first sentence, and use the clicked search result corresponding to the query to obtain the second sentence.
  • the first sentence and the second sentence constitute a sentence pair, and the second sentence
  • the confidence of is determined by the number of clicks of the second sentence when the first sentence is used as a query.
  • search result clicked by the user can be regarded as a search result that meets the user's needs to a certain extent.
  • the more clicks the better the search result meets the user's needs.
  • the search result is more in line with the user's needs.
  • the second sentence When the second sentence is obtained by using the clicked search result, the second sentence can be extracted from the title of the search result or can be obtained from the content of the search result.
  • the specific acquisition method may be related to the specific application. Let’s take a map application as an example. Suppose that the user enters a text search request "nearby breastfeeding children" in the map application, and a large number of search results returned to the user are all POI (Point Of Interest), for example
  • POI Point Of Interest
  • the category of the POI, or the category of the POI and the attribute label of the POI can be used to form the second sentence.
  • the POI category of "Europe Shopping Center” is “shopping center”
  • the attribute tag includes “mother and baby room”
  • “shopping center with baby room” can be obtained as the second sentence, which is similar to the first sentence "nearby”. "A place where you can breastfeed your child” constitutes a sentence pair.
  • Second sentence Confidence A place nearby where you can breastfeed your child shopping center 0.51 A place nearby where you can breastfeed your child Shopping malls with maternity rooms 0.25 A place nearby where you can breastfeed your child Confinement Center 0.11
  • the second sentence consisting of the category of the POI and the attribute label of the POI is preferentially selected.
  • the obtained sentence pair and confidence can be used as training data, where the first sentence is used as the input of the seq2seq model, and the second sentence and its confidence are used as the output of the seq2seq model.
  • the seq2seq model is an encoder-decoder input is a sequence, and the output is also a sequence.
  • the encoder turns a variable-length input sequence into a fixed-length vector, and the decoder decodes the fixed-length vector into a variable-length output sequence.
  • the maximum likelihood estimation method can be used during training.
  • the sentence pair determined in the above manner can be used as a positive sample, and the sentence pair formed by the query and the second sentence obtained from the search result that is not clicked can be used as a negative sample to train the seq2seq model.
  • the training objective is to maximize the difference between the confidence of the second sentence in the positive sample and the confidence of the second sentence in the negative sample.
  • a way of artificially constructing a demand mapping table is to use a sentence as an input, and at least one mapped sentence can be obtained as an output by querying the demand mapping table.
  • the demand statement with the highest confidence level may be used as the first demand statement for returning to the user first.
  • the "first”, “second” etc. involved in the demand statement, such as the "first demand statement” and “the second demand statement” do not have restrictions on the order, quantity and name, but only use To distinguish between different demand expressions.
  • the demand forecasting model maps the first voice command to obtain the demand expressions and their corresponding confidence levels respectively as follows:
  • the first requirement expression is returned to the user in the form of inquiry.
  • the first requirement statement can be returned to the user in the form of a general question sentence, so that the user only needs to answer "yes/no”, “yes/no”, “needed/unnecessary”, “yes/no” "”, “Yes/No” and other simple voices.
  • the reason for the demand analysis failure can be analyzed, and the above query can be further Carry the reason why the requirement analysis failed.
  • the reasons for the failure of analysis requirements analysis can include, but are not limited to, one or any combination of the following treatments:
  • the first processing during speech recognition, noise detection is performed on the background of the user inputting the first speech instruction. For severe noise, it will have an impact in the speech recognition stage, resulting in failure of subsequent demand analysis.
  • the second processing Perform pronunciation detection in the speech recognition process to detect whether the user's pronunciation is accurate. For inaccurate pronunciation, it will also have an impact in the speech recognition stage, which will cause subsequent demand analysis to fail.
  • the third type of processing is performed on the voice recognition result of the first voice command. Too long sentences usually have an adverse effect on demand analysis. For example, it is difficult to analyze the semantics of too long sentences during semantic analysis, which leads to the failure of demand analysis.
  • the fourth processing the spoken language detection is performed on the voice recognition result of the first voice command. Sentences that express too much colloquialism will adversely affect semantic analysis, leading to failure of demand analysis.
  • the reasons for the failure of the obtained demand analysis may include: such as: noisy environment, inaccurate pronunciation of the first voice command, length of the first voice command overrun, the first voice command is too colloquial, and the first voice command is too spoken.
  • One voice command is too general and so on.
  • the voice assistant can return: "The sentence you entered Broader, XX did not hear clearly. Is it to find a shopping center closest to you with a maternity room”.
  • the user only needs to make a positive or negative response to the first demand statement.
  • the demand analysis result of the first demand statement is:
  • the demand expression with the second highest confidence level obtained by the demand prediction model mapping is used as the second demand expression, and the second demand expression is returned to the user in the form of inquiry.
  • the user's voice command for the second requirement expression is received, and if the second voice command that the user confirms the above-mentioned second requirement expression is received, 309 is executed; if the third voice command that the user denies the above-mentioned second requirement expression is received , Go to 310.
  • the preset maximum number of interaction rounds is reached (assuming that the preset maximum number of interaction rounds is 2 rounds), and a result of failure to understand the requirements is returned to the user.
  • the user can return to the user the result of the failure of requirement understanding, such as "didn't understand your requirement”. You can also prompt the user to re-enter the first voice command, such as "I didn't understand your needs, please change to a simple statement.”
  • 2 rounds are used as the maximum number of interaction rounds. If more rounds are used as the maximum number of interaction rounds, then the confidence level obtained by the demand prediction model mapping can continue to be returned to the demand expression in the form of inquiry The user, until the user's confirmation is obtained or the preset maximum number of interaction rounds is reached.
  • Voice assistant The sentence you input is broad, and XX did not hear clearly. Are you looking for a shopping mall with a maternity room near you?
  • the voice assistant returns information to the user about the nearest shopping mall with a maternity and baby room through a display or voice.
  • Fig. 4 is a flowchart of the voice interaction method provided in the third embodiment of the application. As shown in Fig. 4, the method may include the following steps:
  • the demand analysis result corresponding to the first voice command is used to perform a service response, and this voice interaction process is ended.
  • the first voice command is input into a pre-trained demand prediction model, and the demand prediction model maps the first voice command to at least one demand statement, and ranks the top N demand statements with confidence through the inquiry
  • the form is returned to the user, and N is a preset positive integer.
  • the difference between the third embodiment and the second embodiment is that the demand expressions obtained by the demand forecasting model mapping are not returned to the user one by one in each round of interaction, but N demand expressions are expressed in the form of selected interrogative sentences Return to the user together for the user to choose.
  • the reason for the demand analysis failure can also be analyzed, and in the above query Further carry the reason for the failure of requirement analysis. This part is similar to that in the second embodiment, and will not be repeated here.
  • N 2 as an example, for the first voice command input by the user: "Can you help me find a place where I can breastfeed my baby? The baby is hungry, please", the voice assistant can return: "You The sentence entered was broad, and XX did not hear clearly. Are you looking for the nearest shopping mall with a maternity room or the nearest confinement center?". In this way, the user only needs to reply "the former/the latter".
  • the user's voice command is received. If a voice command that the user confirms one of the demand expressions is received, 405 is executed; if a voice command that the user denies all demand expressions is received, 406 is executed.
  • the result of the requirement understanding failure can be returned to the user, such as "didn't understand your requirement”. You can also prompt the user to re-enter the first voice command, such as "I didn't understand your needs, please change to a simple statement.”
  • Voice assistant The sentence you input is broad, and XX did not hear clearly. Are you looking for the nearest shopping mall with maternity rooms or the nearest confinement center?
  • the voice assistant returns information to the user about the nearest shopping mall with a maternity and baby room through a display or voice.
  • FIG. 5 is a structural diagram of a voice interaction device provided in Embodiment 4 of the application.
  • the device may include: a voice interaction unit 01, a voice processing unit 02, a demand prediction unit 03, and a service response unit 04. It may further include a model training unit 05 and a cause analysis unit 06.
  • the main functions of each component are as follows:
  • the voice interaction unit 01 is responsible for receiving and transmitting data from the user and data returned to the user. First, the first voice instruction input by the user is received.
  • the voice processing unit 02 is responsible for voice recognition and demand analysis of the first voice command. For the first voice command input by the user, voice recognition must be performed first. After the text obtained by speech recognition is obtained, demand analysis is performed. The purpose of this demand analysis is mainly to understand the specific needs of users (also called intentions), obtain structured information, and provide accurate services to users. The specific demand analysis methods and analysis results can be related to specific vertical services. This unit can use existing technology, so I won’t repeat it here.
  • the service response unit 04 uses the demand analysis result corresponding to the first voice command to perform a service response.
  • the demand prediction unit 03 performs demand prediction on the first voice command to obtain at least one demand statement. Then the voice interaction unit 01 returns at least one of the requirement expressions to the user in the form of inquiry.
  • the service response unit 04 uses the demand analysis result corresponding to the demand expression determined by the user to perform a service response.
  • the demand prediction unit 03 may input the first voice command into a pre-trained demand prediction model, and the demand prediction model maps the first voice command to at least one demand expression.
  • At least one demand statement obtained by the demand prediction model mapping has a confidence degree, and the confidence degree represents the accuracy with which the demand corresponding to the first voice command can be predicted.
  • the model training unit 05 is responsible for training to obtain the demand prediction model. Specifically, the model training unit 05 obtains training data.
  • the training data includes a plurality of sentence pairs.
  • the sentence pairs include a first sentence and a second sentence.
  • the second sentence can be successfully parsed by demand; the training data is used to train the Seq2Seq model to obtain demand prediction Model, where the first sentence in the sentence pair is used as the input of the Seq2Seq model, and the second sentence is used as the target output of the Seq2Seq model.
  • the above-mentioned training data can be obtained from a text search log.
  • the query in the text search log can be used as the first sentence
  • the clicked search result corresponding to the query can be used to obtain the second sentence.
  • the first sentence and the second sentence form a sentence pair.
  • the number of clicks on the second sentence is determined. The higher the number of clicks, the higher the confidence level.
  • the voice interaction unit When the voice interaction unit returns at least one of the requirements statement to the user in the form of inquiry, it can adopt but not limited to the following two methods:
  • the first method take the demand statement with the highest confidence among at least one demand statement obtained by the demand forecasting model mapping as the first demand statement, and return the first demand statement to the user in the form of inquiry.
  • the demand statement with the second highest confidence level among at least one demand statement obtained by the demand prediction model mapping is used as the second demand statement, and the second demand statement is passed through the inquiry The form is returned to the user.
  • the voice interaction unit 01 can return to the user the result of the failure to understand the requirements, or prompt the user to re-input the first voice instruction.
  • the N demand expressions can be returned to the user in the form of a selection question sentence for the user to choose.
  • the voice interaction unit 01 may return to the user a result that the requirement understanding fails, or may prompt the user to re-enter the first voice instruction.
  • the reason analysis unit 06 can analyze the reason for the failure of requirement analysis, and further carry the reason for the failure of requirement analysis in the inquiry.
  • the reasons for the failure of analysis requirements analysis can include, but are not limited to, one or any combination of the following treatments:
  • the first processing during speech recognition, noise detection is performed on the background of the user inputting the first speech instruction. For severe noise, it will have an impact in the speech recognition stage, resulting in failure of subsequent demand analysis.
  • the second processing Perform pronunciation detection in the speech recognition process to detect whether the user's pronunciation is accurate. For inaccurate pronunciation, it will also have an impact in the speech recognition stage, which will cause subsequent demand analysis to fail.
  • the third type of processing is performed on the voice recognition result of the first voice command. Too long sentences usually have an adverse effect on demand analysis. For example, it is difficult to analyze the semantics of too long sentences during semantic analysis, which leads to the failure of demand analysis.
  • the fourth processing the spoken language detection is performed on the voice recognition result of the first voice command. Sentences that express too much colloquialism will adversely affect semantic analysis, leading to failure of demand analysis.
  • the reasons for the failure of the obtained demand analysis can include: such as: noisy environment, inaccurate pronunciation of the first voice command, length of the first voice command overrun, the first voice command is too colloquial, etc. Wait.
  • the present application also provides an electronic device and a readable storage medium.
  • FIG. 6 it is a block diagram of an electronic device of a voice interaction method according to an embodiment of the present application.
  • Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the application described and/or required herein.
  • the electronic device includes one or more processors 601, a memory 602, and interfaces for connecting various components, including a high-speed interface and a low-speed interface.
  • the various components are connected to each other using different buses, and can be installed on a common motherboard or installed in other ways as needed.
  • the processor may process instructions executed in the electronic device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to an interface).
  • an external input/output device such as a display device coupled to an interface.
  • multiple processors and/or multiple buses can be used with multiple memories and multiple memories.
  • multiple electronic devices can be connected, and each device provides part of the necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system).
  • a processor 601 is taken as an example.
  • the memory 602 is a non-transitory computer-readable storage medium provided by this application.
  • the memory stores instructions executable by at least one processor, so that the at least one processor executes the voice interaction method provided in this application.
  • the non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make a computer execute the voice interaction method provided by the present application.
  • the memory 602 can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the voice interaction method in the embodiments of the present application.
  • the processor 601 executes various functional applications and data processing of the server by running non-transient software programs, instructions, and modules stored in the memory 602, that is, realizing the voice interaction method in the foregoing method embodiment.
  • the memory 602 may include a program storage area and a data storage area.
  • the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the electronic device.
  • the memory 602 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 602 may optionally include memories remotely provided with respect to the processor 601, and these remote memories may be connected to the electronic device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the electronic device of the voice interaction method may further include: an input device 603 and an output device 604.
  • the processor 601, the memory 602, the input device 603, and the output device 604 may be connected by a bus or in other ways. In FIG. 6, the connection by a bus is taken as an example.
  • the input device 603 can receive input digital or character information, and generate key signal input related to the user settings and function control of the electronic device, such as touch screen, keypad, mouse, track pad, touch pad, indicator stick, one or more A mouse button, trackball, joystick and other input devices.
  • the output device 604 may include a display device, an auxiliary lighting device (for example, LED), a tactile feedback device (for example, a vibration motor), and the like.
  • the display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.
  • Various implementations of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, the programmable processor It can be a dedicated or general-purpose programmable processor that can receive data and instructions from the storage system, at least one input device, and at least one output device, and transmit the data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
  • machine-readable medium and “computer-readable medium” refer to any computer program product, device, and/or device used to provide machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memory, programmable logic devices (PLD)), including machine-readable media that receive machine instructions as machine-readable signals.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the systems and techniques described here can be implemented on a computer that has: a display device for displaying information to the user (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) ); and a keyboard and a pointing device (for example, a mouse or a trackball) through which the user can provide input to the computer.
  • a display device for displaying information to the user
  • LCD liquid crystal display
  • keyboard and a pointing device for example, a mouse or a trackball
  • Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, voice input, or tactile input) to receive input from the user.
  • the systems and technologies described herein can be implemented in a computing system that includes back-end components (for example, as a data server), or a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, A user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the system and technology described herein), or includes such back-end components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be connected to each other through any form or medium of digital data communication (for example, a communication network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • the computer system can include clients and servers.
  • the client and server are generally far away from each other and usually interact through a communication network.
  • the relationship between the client and the server is generated by computer programs that run on the corresponding computers and have a client-server relationship with each other.
  • the method, device, equipment and computer storage medium provided in this application can have the following advantages:
  • the reason for the failure of demand understanding will be analyzed while predicting the user's needs, and the analyzed reason will be returned to the user, so as to reduce the user's doubts and anxiety and further improve the user experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本申请公开了一种语音交互方法、装置、设备和计算机存储介质,涉及人工智能领域。具体实现方案为:对用户输入的第一语音指令进行语音识别和需求解析;若需求解析失败,则对所述第一语音指令进行需求预测,得到至少一个需求表述;将所述需求表述中的至少一个通过问询的形式返回给所述用户;若接收到所述用户确定所述需求表述中的至少一个的第二语音指令,则利用所述用户确定的需求表述对应的需求解析结果进行服务响应。本申请能够有效提高用户的交互效率,提升用户体验。

Description

一种语音交互方法、装置、设备和计算机存储介质
本申请要求了申请日为2020年02月18日,申请号为2020100995744发明名称为“一种语音交互方法、装置、设备和计算机存储介质”的中国专利申请的优先权。
技术领域
本申请涉及计算机应用技术领域,特别涉及人工智能领域的一种语音交互方法、装置、设备和计算机存储介质。
背景技术
本部分旨在为权利要求书中陈述的本发明的实施方式提供背景或上下文。此处的描述不因为包括在本部分中就被认为是现有技术。
随着语音交互技术的不断发展,用户能够与智能音箱、智能手机等终端设备进行语音交互。除了终端设备操作系统自带的语音助手之外,越来越多的应用搭载了语音交互技术。用户能够通过输入语音指令获取对应的服务,从而在很大程度上解放双手。
现有语音交互场景中,在对用户输入的语音指令进行语音识别和需求理解后,如果不能够很好地理解用户需求,则会向用户返回以下两种语音响应:
1)返回不理解用户需求的结果,例如“对不起,暂时不理解你讲了啥”。
2)向用户返回提示用户更换语音指令的结果,例如“对不起,请您更换个简单点的说法试试”。
但无论上述哪种语音响应均给用户带来较差的体验,用户会觉得语音助手的智能化程度太差,另外需要反复变换输入的语音指令,对于用户的交互效率来说,非常低下。
发明内容
有鉴于此,本申请提供了一种语音交互方法、装置、设备和计算机存储介质,以便于提高用户的交互效率,提升用户体验。
第一方面,本申请提供了一种语音交互方法,该方法包括:
对用户输入的第一语音指令进行语音识别和需求解析;
若需求解析失败,则对所述第一语音指令进行需求预测,得到至少一个需求表述;
将所述需求表述中的至少一个通过问询的形式返回给所述用户;
若接收到所述用户确定所述需求表述中的至少一个的第二语音指令,则利用所述用户确定的需求表述对应的需求解析结果进行服务响应。
根据本申请一优选实施方式,该方法还包括:
若需求解析成功,则利用所述第一语音指令对应的需求解析结果进行服务响应。
根据本申请一优选实施方式,对所述第一语音指令进行需求预测,得到至少一个需求表述包括:
将所述第一语音指令输入预先训练得到的需求预测模型,所述需求预测模型将所述第一语音指令映射至至少一个需求表述。
根据本申请一优选实施方式,所述需求预测模型采用以下方式预先训练得到:
获取训练数据,所述训练数据包括多个语句对,所述语句对包括第一语句和第二语句,其中第二语句能够被需求解析成功;
利用所述训练数据训练序列到序列Seq2Seq模型,得到所述需求预测模型,其中语句对中第一语句作为所述Seq2Seq模型的输入,第二语句作为所述Seq2Seq模型的目标输出。
根据本申请一优选实施方式,所述训练数据从文本搜索日志中获取;
其中将文本搜索请求query作为第一语句,利用query所对应的被点击搜索结果得到第二语句,将所述第一语句和所述第二语句构成语句对,第二语句的置信度由第一语句作为query时所述第二语句的被点击次数确定。
根据本申请一优选实施方式,将所述需求表述中的至少一个通过问询的形式返回给所述用户包括:
将所述需求预测模型映射得到的至少一个需求表述中置信度最高的需求表述作为所述第一需求表述;
将所述第一需求表述通过问询的形式返回给所述用户。
根据本申请一优选实施方式,将所述需求表述中的至少一个通过问询的形式返回给所述用户还包括:
若接收到所述用户否定所述第一需求表述的第三语音指令,则将所述需求预测模型映射得到的至少一个需求表述中置信度次高的需求表述作为第二需求表述;
将第二需求表述通过问询的形式返回给所述用户。
根据本申请一优选实施方式,将所述需求表述中的至少一个通过问询的形式返回给所述用户包括:
将所述需求预测模型映射得到的至少一个需求表述中置信度排在前N个的需求表述,通过问询的形式返回给所述用户,所述N为预设的正整数。
根据本申请一优选实施方式,该方法还包括:
分析所述需求解析失败的原因,在所述问询中进一步携带所述需求解析失败的原因。
根据本申请一优选实施方式,所述需求解析失败的原因包括:
环境嘈杂、所述第一语音指令的长度超限、所述第一语音指令的发音不准确或所述第一语音指令口语化。
第二方面,本申请提供了一种语音交互装置,该装置包括:
语音交互单元,用于接收用户输入的第一语音指令;
语音处理单元,用于对所述第一语音指令进行语音识别和需求解析;
需求预测单元,用于若所述需求解析失败,则对所述第一语音指令进行需求预测,得到至少一个需求表述;
所述语音交互单元,还用于将所述需求表述中的至少一个通过问询的形式返回给所述用户;
服务响应单元,用于若所述语音交互单元接收到所述用户确定所述需求表述中的至少一个的第二语音指令,则利用所述用户确定的需求表述对应的需求解析结果进行服务响应。
根据本申请一优选实施方式,所述服务响应单元,还用于若需求解析成功,则利用所述第一语音指令对应的需求解析结果进行服务响应。
根据本申请一优选实施方式,所述需求预测单元,具体用于将所述第一语音指令输入预先训练得到的需求预测模型,所述需求预测模型将 所述第一语音指令映射至至少一个需求表述。
根据本申请一优选实施方式,该装置还包括:
模型训练单元,用于获取训练数据,所述训练数据包括多个语句对,所述语句对包括第一语句和第二语句,其中第二语句能够被需求解析成功;利用所述训练数据训练Seq2Seq模型,得到所述需求预测模型,其中语句对中第一语句作为所述Seq2Seq模型的输入,第二语句作为所述Seq2Seq模型的目标输出。
根据本申请一优选实施方式,所述语音交互单元在将所述需求表述中的至少一个通过问询的形式返回给所述用户时,具体执行:
将所述需求预测模型映射得到的至少一个需求表述中置信度最高的需求表述作为所述第一需求表述;
将所述第一需求表述通过问询的形式返回给所述用户。
根据本申请一优选实施方式,所述语音交互单元在将所述需求表述中的至少一个通过问询的形式返回给所述用户时,还用于:
若接收到所述用户否定所述第一需求表述的第三语音指令,则将所述需求预测模型映射得到的至少一个需求表述中置信度次高的需求表述作为第二需求表述;
将第二需求表述通过问询的形式返回给所述用户。
根据本申请一优选实施方式,所述语音交互单元在将所述需求表述中的至少一个通过问询的形式返回给所述用户时,具体执行:
将所述需求预测模型映射得到的至少一个需求表述中置信度排在前N个的需求表述,通过问询的形式返回给所述用户,所述N为预设的正整数。
根据本申请一优选实施方式,该装置还包括:
原因分析单元,用于分析所述需求解析失败的原因,在所述问询中进一步携带所述需求解析失败的原因。
第三方面,本申请提供了一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上任 一项所述的方法。
第四方面,本申请提供了一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使所述计算机执行如上任一项所述的方法。
由以上技术方案可以看出,本申请在对用户输入的语音指令需求解析失败后,会进一步对语音指令进行需求预测,“猜测”用户可能的需求表述以返回给用户确认,而不是简单粗暴地告知用户无法理解用户所说的,从而提高用户的交互效率,提升用户体验。
上述可选方式所具有的其他效果将在下文中结合具体实施例加以说明。
附图说明
附图用于更好地理解本方案,不构成对本申请的限定。其中:
图1示出了可以应用本发明实施例的语音交互方法或语音交互装置的示例性系统架构;
图2为本申请实施例一提供的语音交互方法的流程图;
图3为本申请实施例二提供的语音交互方法的流程图;
图4为本申请实施例三提供的语音交互方法的流程图;
图5为本申请实施例四提供的语音交互装置的结构图;
图6是用来实现本申请实施例的语音交互方法的电子设备的框图。
具体实施方式
以下结合附图对本申请的示范性实施例做出说明,其中包括本申请实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本申请的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
图1示出了可以应用本发明实施例的语音交互方法或语音交互装置的示例性系统架构。
如图1所示,该系统架构可以包括终端设备101和102,网络103和服务器104。网络103用以在终端设备101、102和服务器104之间提 供通信链路的介质。网络103可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101和102通过网络103与服务器104交互。终端设备101和102上可以安装有各种应用,例如语音交互应用、网页浏览器应用、通信类应用等。
终端设备101和102可以是支持语音交互的各种电子设备,可以是有屏设备,也可以是无屏设备。包括但不限于智能手机、平板电脑、智能音箱、智能电视等等。本发明所提供的语音交互装置可以设置并运行于上述服务器104中,也可以设置并运行于处理功能强大的终端设备101和102中。其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块,在此不做具体限定。
例如,语音交互装置设置并运行于上述服务器104中,终端设备101将用户输入的语音指令通过网络103发送至服务器104。服务器104采用本发明实施例提供的方法进行处理后,将处理结果返回至终端设备101,进而由终端设备101提供给用户,从而实现与用户之间的语音交互。
服务器104可以是单一服务器,也可以是是多个服务器构成的服务器群组。应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
传统语音交互场景中,在对用户输入的语音指令进行语音识别和需求解析后,如果需求解析失败,则会向用户返回不理解用户需求的结果,或者想用户返回提示用户更换语音指令的结果。例如以下语音交互场景:
场景1、
用户:你能帮我找一个离我最近的能给孩子喂奶的地方吗,孩子饿了,拜托了。
语音助手:对不起,XX(表示语音助手的名字,例如百度的小度、小米的小爱、阿里巴巴的天猫精灵等)暂时不理解你讲了啥。
这种场景下,用户会想这个语音助手真笨,一点都不智能。
场景2、
用户:你能帮我找一个离我最近的能给孩子喂奶的地方吗,孩子饿了,拜托了。
语音助手:对不起,XX暂时不理解你说的,请您换个简单点的说 法试试。
用户:我需要找个离我最近的能给孩子喂奶的地方。
语音助手:对不起,XX暂时不理解你说的,请您换个简单点的说法试试。
这种场景下,用户通过多次尝试和更换说法,语音助手仍不能够理解用户需求,用户很容易就失去了耐心。这种交互方式对于用户而言,显然效率是非常低下的,用户体验也非常差。
有鉴于此,本申请的核心思想在于,当用户输入的语音指令进行语音识别和需求解析后,若需求解析失败,会进一步对语音指令进行需求预测,“猜测”用户可能的需求表述以返回给用户确认,而不是简单粗暴地告知用户无法理解用户所说的。下面结合实施例对本申请提供的方法进行详细描述。
实施例一、
图2为本申请实施例一提供的语音交互方法的流程图,如图2中所示,该方法可以包括以下步骤:
在201中,对用户输入的第一语音指令进行语音识别和需求解析。
该第一语音指令可以是用户唤醒语音助手后,输入的第一句语音指令。也可以是在某些特定场景下输入的语音指令。
本申请中对于语音指令所涉及的“第一”、“第二”、“第三”等,例如“第一语音指令”、“第二语音指令”和“第三语音指令”,并不具备顺序、数量和名称上的限制,而仅用于区分不同的语音指令。
对于用户输入的第一语音指令,首先要进行语音识别。在获取到语音识别得到的文本后,进行需求解析。该需求解析的目的主要是了解用户的具体需求(也可以称为意图),获取结构化信息,以为用户提供精准地服务。具体的需求解析方式和解析结果可以与具体的垂直类服务相关。这部分内容可以采用已有技术,在此不做赘述。仅举一个例子:
假设用户输入第一语音指令:“请帮我规划一条从西二旗出发途经南锣鼓巷到北京站的不堵车路线吧”,通过语音识别和需求解析后,得到的解析结果为:
“意图:路线规划
出行方式:驾车
出发点:西二旗
目的地:北京站
路径点:南锣鼓巷
筛选条件:不堵车”。
在202中,若需求解析失败,则对所述第一语音指令进行需求预测,得到至少一个需求表述。
但在进行需求解析时,由于种种原因可能会导致需求解析失败,即无法准确获取用户的需求类型、结构化信息等。在这种情况下,本申请并非简单粗暴地告知用户解析失败,而是对第一语音指令进行需求预测,即猜测用户的需求,并将预测得到的至少一个需求表述返回给用户。
在203中,将所述需求表述中的至少一个通过问询的形式返回给所述用户。
在204中,若接收到所述用户确定所述需求表述中的至少一个的第二语音指令,则利用所述用户确定的需求表述对应的需求解析结果进行服务响应。
在本申请中可以采用多种实现方式,例如,可以采用多轮交互的方式每次向用户返回一个需求表述,然后采用一般疑问句的方式向用户问询,如果得到对其中一个需求表述的确定,则利用该确定的需求表述对应的需求解析结果进行服务响应。如果得到否定回答,则继续在下一轮采用一般疑问句的方式向用户返回一个需求表述,依次类推,直至达到预设的最大交互轮数。这种方式后续将在实施例二中进行详细描述。
再例如,也可以每次向户返回多个需求表述,然后采用选择疑问句的方式向用户问询。若用户选择其中一个需求表述,则按照用户选择的需求表述对应的需求解析结果进行服务响应。这种方式后续将在实施例三中进行详细描述。
实施例二、
图3为本申请实施例二提供的语音交互方法的流程图,如图3中所示,该方法可以包括以下步骤:
在301中,对用户输入的第一语音指令进行语音识别和需求解析,若需求解析成功,则执行302;若需求解析失败,则执行303。
在302中,利用第一语音指令对应的需求解析结果进行服务响应, 结束本次语音交互流程。
如果能够直接对用户输入的第一语音指令进行需求解析成功,则直接利用需求解析结果进行服务响应即可,无需进行多轮交互。
在303中,将第一语音指令输入预先训练得到的需求预测模型,该需求预测模型将第一语音指令映射至至少一个需求表述,将其中置信度最高的需求表述作为第一需求表述。
如果对用户输入的第一语音指令进行需求解析失败,则对第一语音指令进行需求预测。在本申请实施例中,可以利用预先训练得到的需求预测模型进行需求预测。该需求预测模型能够将第一语音指令映射为多种与第一语音指令接近的需求表述,从而对第一语音指令所代表的用户需求进行“猜测”。需求预测模型映射得到的至少一个需求表述均具有置信度,该置信度代表能够预测第一语音指令所对应需求的准确程度。
为了方便理解,在此对需求预测模型的训练过程举一个优选的实施方式进行介绍。鉴于语音交互的局限性,一方面不可能非常泛化地向用户返回多样的搜索结果,另一方面返回的搜索结果数量不会很多,通常只会在明确用户需求的情况下,返回对应需求的若干个搜索结果。而对于文本搜索而言,则没有这方面的局限性。只要用户输入了文本搜索请求query,就会基于相似度排序向用户返回数量较多的搜索结果,用户能够从众多搜索结果中找寻自己需要的内容,并进行点击甚至进一步地获取垂直类服务。在本申请实施例中,就可以采用文本搜索日志的内容作为依据,从中提取用户的query所对应的需求表述。即从文本搜索日志中获取训练数据对需求预测模型进行训练。
具体地,需求预测模型的训练过程可以包括以下步骤:
步骤S1、获取训练数据,该训练数据包括大量的语句对,每个语句对均包括两个语句:第一语句和第二语句,其中第二语句能够被需求解析成功,也就是说,第二语句采用的表达是经过需求解析后能够明确需求的表达。
步骤S2、利用训练数据Seq2Seq(训练序列到序列)模型,得到需求预测模型,其中语句对中第一语句作为Seq2Seq模型的输入,第二语句作为Seq2Seq模型的目标输出。
作为一种优选的实施方式,可以首先从文本搜索日志中获取训练数 据。当然除了文本搜索日志中获取训练数据之外,也可以采用其他方式,例如人工构造训练数据的方式。本申请中仅以从文本搜索日志中获取训练数据为例进行详细描述。
在文本搜索日志中,可以将用户输入的文本搜索请求即query作为第一语句,利用query所对应的被点击搜索结果得到第二语句,由第一语句和第二语句构成语句对,第二语句的置信度由第一语句作为query时第二语句的被点击次数确定。
通常用户在进行文本搜索时,输入query后会从大量的搜索结果中查找自己需要的搜索结果。那么用户点击的搜索结果可以认为在一定程度上符合用户需求的搜索结果。并且,越多次点击,说明该搜索结果越符合用户需求,更进一步地,若用户请求并获取了与该搜索结果相关的服务,则该搜索结果更符合用户需求。
在利用被点击搜索结果得到第二语句时,该第二语句可以从搜索结果的标题中摘取,也可以从搜索结果的内容中获取。具体的获取方式可以与具体应用相关。下面以地图类应用为例,假设用户在地图类应用中输入了文本搜索请求“附近能给孩子喂奶的地方”,向用户返回的大量搜索结果均为POI(Point Of Interest,兴趣点),例如返回的POI包括:
POI1:妈妈爱萌宝母婴用品店
POI2:贝爱贝亲月子中心
POI3:欧美汇购物中心
假设用户点击了其中的POI3,可以利用该POI的类别,或者POI的类别和该POI的属性标签构成第二语句。例如“欧美汇购物中心”的POI类别为“购物中心”,属性标签包括“有母婴室”,那么可以得到“有母婴室的购物中心”作为第二语句,其与第一语句“附近能给孩子喂奶的地方”构成一个语句对。
采用类似的方式,假设可以获取到如下语句对以及置信度:
第一语句 第二语句 置信度
附近能给孩子喂奶的地方 购物中心 0.51
附近能给孩子喂奶的地方 有母婴室的购物中心 0.25
附近能给孩子喂奶的地方 月子中心 0.11
附近能给孩子喂奶的地方 母婴店 0.09
附近能给孩子喂奶的地方 有充电桩的购物中心 0.04
将置信度低于预设置信度阈值的语句对过滤掉后,优先选择POI的类别和该POI的属性标签构成的第二语句。可以将得到的语句对以及置信度作为训练数据,其中第一语句作为seq2seq模型的输入,第二语句及其置信度作为seq2seq模型的输出。在获取训练数据seq2seq模型是一个encoder(编码器)-decoder(解码器)输入是一个序列,输出也是一个序列。encoder将一个可变长度的输入序列变为固定长度的向量,decoder将这个固定长度的向量解码成可变长度的输出序列,在训练时可以采用最大似然估计的方式。
另外,还可以将上述方式确定出的语句对作为正样本,将query和未被点击的搜索结果得到的第二语句构成的语句对作为负样本,进行seq2seq模型的训练。训练目标为:最大化正样本中第二语句的置信度与负样本中第二语句的置信度的差值。
当然,除了采用训练seq2seq模型得到需求预测模型的方式之外,也可以采用其他方式实现需求预测模型。例如,人为构建需求映射表的方式,即将一个语句作为输入,通过查询需求映射表能够得到至少一个映射的语句作为输出。
对于需求预测模型映射得到的至少一个需求表述,本实施例中可以将置信度最高的需求表述作为第一需求表述用于首先返回给用户。本申请中对于需求表述所涉及的“第一”、“第二”等,例如“第一需求表述”、“第二需求表述”,并不具备顺序、数量和名称上的限制,而仅用于区分不同的需求表述。
仍以用户输入第一语音指令“你能帮我找一个离我最近的能给孩子喂奶的地方吗,孩子饿了,拜托了”为例,因某种或某些原因需求解析失败后,将其输入需求预测模型中,需求预测模型将第一语音指令进行映射后得到的各需求表述及其对应的置信度分别如下:
有母婴室的购物中心    0.92
月子中心              0.81
母婴店                0.68
……
将“有母婴室的购物中心”与预设的模板进行组合得到“找一个离您最近的有母婴室的购物中心”作为第一需求表述。其中“找一个离您最近的”为预设的模板,目的是为了使得需求表述更加顺畅和符合话术,但不附加预设模板也同样可以。
在304中,将第一需求表述通过问询的形式返回给用户。
在本实施例中,可以采用一般疑问句的形式将第一需求表述返回给用户,这样用户仅需要回答“是/不是”、“是/否”、“需要/不需要”、“可以/不可以”、“对/不对”等简单的语音即可。
进一步地,为了明确告知用户本次为什么会需求解析失败,从而减轻用户的焦虑与疑惑,提升用户体验,在本申请实施例中,可以分析需求解析失败的原因,并且在上述问询中可以进一步携带需求解析失败的原因。
分析需求解析失败的原因可以包括但不限于以下处理中的一种或任意组合:
第一种处理:在语音识别时,对用户输入第一语音指令的背景进行噪声检测,对于噪声严重的情况,在语音识别阶段就会产生影响,从而导致后续需求解析失败。
第二种处理:在语音识别过程中进行发音的检测,以检测用户是否发音准确。对于发音不准确的情况,同样在语音识别阶段就会产生影响,从而导致后续需求解析失败。
第三种处理:对第一语音指令的语音识别结果进行文本长度的检测。对于过长的语句,通常会对需求解析产生不利影响,例如在进行语义分析时很难分析出过长语句的语义,从而导致需求解析失败。
第四种处理:对第一语音指令的语音识别结果进行口语化检测。对于表达过于口语化的语句,会对语义分析产生不利影响,从而导致需求解析失败。
还可能存在其他处理方式,在此不做一一穷举。
对应于上述几种处理方式,得到的需求解析失败的原因可以包括:诸如:环境嘈杂、第一语音指令的发音不准确、第一语音指令的长度超限、第一语音指令过于口语化,第一语音指令过于泛化等等。
另外,在问询时还可以利用预设的模板来形成一般疑问句,例如“需 要为您查找吗”,“是否需要为您查找”,“是要……吗”等。
接续上例,对于用户输入的第一语音指令:“你能帮我找一个离我最近的能给孩子喂奶的地方吗,孩子饿了,拜托了”,语音助手可以返回:“您输入的语句较宽泛,XX没有听清楚。是要找一个离您最近的有母婴室的购物中心吗”。
在305中,接收用户针对第一需求表述的语音指令,若接收到用户确定上述第一需求表述的第二语音指令,则执行306;若接收到用户否定上述第一需求表述的第三语音指令,则执行307。
在实施例中,用户仅需要针对第一需求表述进行确定或否定的回复。
在306中,利用第一需求表述对应的需求解析结果进行服务响应,结束本次语音交互流程。
接续上例,由于第一需求表述是置信度最高的需求表述,因此用户对于“您输入的语句较宽泛,XX没有听清楚。是要找一个离您最近的有母婴室的购物中心吗”很大概率是确定的,若用户回复的第二语音指令是确定该第一需求表述的,则对第一需求表述的需求解析结果为:
“意图:信息检索
出发点:当前位置
检索词:购物中心
筛选条件:有母婴室、距离最近”
经过检索后,向用户返回距离其最近的有母婴室的购物中心。
在307中,将需求预测模型映射得到的置信度次高的需求表述作为第二需求表述,将第二需求表述通过问询的形式返回给用户。
接续上例,若用户对于“您输入的语句较宽泛,XX没有听清楚。是要找一个离您最近的有母婴室的购物中心吗”,返回的是第三语音指令“不是”,则可以向用户返回“是要找一个离您最近的月子中心吗”。
在308中,接收用户针对第二需求表述的语音指令,若接收到用户确定上述第二需求表述的第二语音指令,则执行309;若接收到用户否定上述第二需求表述的第三语音指令,则执行310。
在309中,利用第二需求表述对应的需求解析结果进行服务响应,结束本次语音交互流程。
在310中,达到预设的最大交互轮数(假设预设的最大交互轮数为 2轮),向用户返回需求理解失败的结果。
若用户依然对第二需求表述是否定的,且预设的最大交互轮数为2,已经达到最大交互轮数,则可以向用户返回需求理解失败的结果,例如“没理解您的需求”。也可以提示用户重新输入第一语音指令,例如“没理解您的需求,请您换一个简单的说法”。
在本实施例中以2轮作为最大交互轮数,若以更多轮次作为最大交互轮数,则可以继续将需求预测模型映射得到的置信度再次之的需求表述以问询的形式返回给用户,直至获得用户确认或达到预设的最大交互轮数。
为了与传统的场景进行对比,本实施例对应的场景大多数是如下的情况:
用户:你能帮我找一个离我最近的能给孩子喂奶的地方吗,孩子饿了,拜托了。
语音助手:您输入的语句较宽泛,XX没有听清楚。是要找一个离您最近的有母婴室的购物中心吗?
用户:是的。
语音助手通过显示器或语音等形式向用户返回距离其最近的有母婴室的购物中心的信息。
很显然,这种方式相比较传统方式大大提升了用户的交互效率和使用体验。
实施例三、
图4为本申请实施例三提供的语音交互方法的流程图,如图4中所示,该方法可以包括以下步骤:
在401中,对用户输入的第一语音指令进行语音识别和需求解析,若需求解析成功,则执行402;若需求解析失败,则执行403。
在402中,利用第一语音指令对应的需求解析结果进行服务响应,结束本次语音交互流程。
在403中,将第一语音指令输入预先训练得到的需求预测模型,该需求预测模型将第一语音指令映射至至少一个需求表述,将置信度排在前N个的需求表述,通过问询的形式返回给用户,N为预设的正整数。
本实施例三与实施例二的不同之处在于,对于需求预测模型映射得 到的需求表述,不是在每一轮交互中逐个的返回给用户,而是将其中N个需求表述以选择疑问句的形式一起返回给用户,以供用户选择。
进一步地,为了明确告知用户本次为什么会需求解析失败,从而减轻用户的焦虑与疑惑,提升用户体验,在本申请实施例中,也可以分析需求解析失败的原因,并且在上述问询中可以进一步携带需求解析失败的原因。该部分与实施例二中类似,在此不做赘述。
以N是2为例,对于用户输入的第一语音指令:“你能帮我找一个离我最近的能给孩子喂奶的地方吗,孩子饿了,拜托了”,语音助手可以返回:“您输入的语句较宽泛,XX没有听清楚。您是要找一个离您最近的有母婴室的购物中心还是找一个离您最近的月子中心?”。这样,用户仅需要回复“前者/后者”即可。
在404中,接收用户的语音指令,若接收到用户确定其中一个需求表述的语音指令,则执行405;若接收到用户否定所有需求表述的语音指令,则执行406。
在405中,利用用户确定的需求表述对应的需求解析结果进行服务响应,结束本次语音交互流程。
在406中,向用户返回需求理解失败的结果。
若用户对于任何的需求表述都没有确认,则可以向用户返回需求理解失败的结果,例如“没理解您的需求”。也可以提示用户重新输入第一语音指令,例如“没理解您的需求,请您换一个简单的说法”。
为了与传统的场景进行对比,本实施例对应的场景大多数是如下的情况:
用户:你能帮我找一个离我最近的能给孩子喂奶的地方吗,孩子饿了,拜托了。
语音助手:您输入的语句较宽泛,XX没有听清楚。您是要找一个离您最近的有母婴室的购物中心还是找一个离您最近的月子中心?
用户:前者。
语音助手通过显示器或语音等形式向用户返回距离其最近的有母婴室的购物中心的信息。
很显然,这种方式相比较传统方式大大提升了用户的交互效率和使用体验。
以上是对本申请所提供的方法进行的详细描述,下面结合实施例对本申请提供的装置进行详细描述。
实施例四、
图5为本申请实施例四提供的语音交互装置的结构图,如图5中所示,该装置可以包括:语音交互单元01、语音处理单元02、需求预测单元03和服务响应单元04,还可以进一步包括模型训练单元05和原因分析单元06。其中各组成单元的主要功能如下:
语音交互单元01负责实现接收与传递来自用户的数据以及向用户返回的数据。首先接收用户输入的第一语音指令。
语音处理单元02负责对第一语音指令进行语音识别和需求解析。对于用户输入的第一语音指令,首先要进行语音识别。在获取到语音识别得到的文本后,进行需求解析。该需求解析的目的主要是了解用户的具体需求(也可以称为意图),获取结构化信息,以为用户提供精准地服务。具体的需求解析方式和解析结果可以与具体的垂直类服务相关。本单元可以采用已有技术,在此不做赘述。
若需求解析成功,则服务响应单元04利用第一语音指令对应的需求解析结果进行服务响应。
若需求解析失败,则需求预测单元03对第一语音指令进行需求预测,得到至少一个需求表述。然后由语音交互单元01将需求表述中的至少一个通过问询的形式返回给用户。
若语音交互单元01接收到用户确定需求表述中的至少一个的第二语音指令,则由服务响应单元04利用用户确定的需求表述对应的需求解析结果进行服务响应。
具体地,需求预测单元03可以将第一语音指令输入预先训练得到的需求预测模型,需求预测模型将第一语音指令映射至至少一个需求表述。需求预测模型映射得到的至少一个需求表述均具有置信度,该置信度代表能够预测第一语音指令所对应需求的准确程度。
模型训练单元05负责训练得到需求预测模型。具体地,模型训练单元05获取训练数据,训练数据包括多个语句对,语句对包括第一语句和第二语句,其中第二语句能够被需求解析成功;利用训练数据训练Seq2Seq模型,得到需求预测模型,其中语句对中第一语句作为Seq2Seq 模型的输入,第二语句作为Seq2Seq模型的目标输出。
作为一种优选的实施方式,上述训练数据可以从文本搜索日志中获取。具体地,可以将文本搜索日志中query作为第一语句,利用query所对应的被点击搜索结果得到第二语句,将第一语句和第二语句构成语句对,第二语句的置信度由第一语句作为query时第二语句的被点击次数确定。被点击次数越多对应的置信度越高。
语音交互单元在将需求表述中的至少一个通过问询的形式返回给用户时,可以采用但不限于以下两种方式:
第一种方式:将需求预测模型映射得到的至少一个需求表述中置信度最高的需求表述作为第一需求表述,将第一需求表述通过问询的形式返回给用户。
若接收到用户否定第一需求表述的第三语音指令,则将需求预测模型映射得到的至少一个需求表述中置信度次高的需求表述作为第二需求表述,将第二需求表述通过问询的形式返回给用户。
这种方式下的问询可以采用一般疑问句的形式,这样用户仅需要回答“是/不是”、“是/否”、“需要/不需要”、“可以/不可以”、“对/不对”等简单的语音即可。
另外,在这种方式下可以限制最大交互轮数,当达到最大交互轮数后,语音交互单元01可以向用户返回需求理解失败的结果,也可以提示用户重新输入第一语音指令。
第二种方式:将需求预测模型映射得到的至少一个需求表述中置信度排在前N个的需求表述,通过问询的形式返回给用户,N为预设的正整数。
这种方式下,可以采用选择疑问句的形式将N个需求表述返回给用户,以供用户选择。
若用户对于任何的需求表述都没有确认,则语音交互单元01可以向用户返回需求理解失败的结果,也可以提示用户重新输入第一语音指令。
更进一步地,原因分析单元06可以分析需求解析失败的原因,在问询中进一步携带需求解析失败的原因。
分析需求解析失败的原因可以包括但不限于以下处理中的一种或任意组合:
第一种处理:在语音识别时,对用户输入第一语音指令的背景进行噪声检测,对于噪声严重的情况,在语音识别阶段就会产生影响,从而导致后续需求解析失败。
第二种处理:在语音识别过程中进行发音的检测,以检测用户是否发音准确。对于发音不准确的情况,同样在语音识别阶段就会产生影响,从而导致后续需求解析失败。
第三种处理:对第一语音指令的语音识别结果进行文本长度的检测。对于过长的语句,通常会对需求解析产生不利影响,例如在进行语义分析时很难分析出过长语句的语义,从而导致需求解析失败。
第四种处理:对第一语音指令的语音识别结果进行口语化检测。对于表达过于口语化的语句,会对语义分析产生不利影响,从而导致需求解析失败。
还可能存在其他处理方式,在此不做一一穷举。
对应于上述几种处理方式,得到的需求解析失败的原因可以包括:诸如:环境嘈杂、第一语音指令的发音不准确、第一语音指令的长度超限、第一语音指令过于口语化,等等。
根据本申请的实施例,本申请还提供了一种电子设备和一种可读存储介质。
如图6所示,是根据本申请实施例的语音交互方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。
如图6所示,该电子设备包括:一个或多个处理器601、存储器602,以及用于连接各部件的接口,包括高速接口和低速接口。各个部件利用不同的总线互相连接,并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在电子设备内执行的指令进行处理,包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如,耦合至接口的显示设备)上显示GUI的图形信息的指令。在其它实施方式中,若需 要,可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样,可以连接多个电子设备,各个设备提供部分必要的操作(例如,作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图6中以一个处理器601为例。
存储器602即为本申请所提供的非瞬时计算机可读存储介质。其中,所述存储器存储有可由至少一个处理器执行的指令,以使所述至少一个处理器执行本申请所提供的语音交互方法。本申请的非瞬时计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行本申请所提供的语音交互方法。
存储器602作为一种非瞬时计算机可读存储介质,可用于存储非瞬时软件程序、非瞬时计算机可执行程序以及模块,如本申请实施例中的语音交互方法对应的程序指令/模块。处理器601通过运行存储在存储器602中的非瞬时软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例中的语音交互方法。
存储器602可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据该电子设备的使用所创建的数据等。此外,存储器602可以包括高速随机存取存储器,还可以包括非瞬时存储器,例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些实施例中,存储器602可选包括相对于处理器601远程设置的存储器,这些远程存储器可以通过网络连接至该电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
语音交互方法的电子设备还可以包括:输入装置603和输出装置604。处理器601、存储器602、输入装置603和输出装置604可以通过总线或者其他方式连接,图6中以通过总线连接为例。
输入装置603可接收输入的数字或字符信息,以及产生与该电子设备的用户设置以及功能控制有关的键信号输入,例如触摸屏、小键盘、鼠标、轨迹板、触摸板、指示杆、一个或者多个鼠标按钮、轨迹球、操纵杆等输入装置。输出装置604可以包括显示设备、辅助照明装置(例如,LED)和触觉反馈装置(例如,振动电机)等。该显示设备可以包括但不限于,液晶显示器(LCD)、发光二极管(LED)显示器和等离 子体显示器。在一些实施方式中,显示设备可以是触摸屏。
此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、专用ASIC(专用集成电路)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
这些计算程序(也称作程序、软件、软件应用、或者代码)包括可编程处理器的机器指令,并且可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。如本文使用的,术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如,磁盘、光盘、存储器、可编程逻辑装置(PLD)),包括,接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通 过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。
由以上描述可以看出,本申请提供的方法、装置、设备和计算机存储介质可以具备以下优点:
1)本申请在对用户输入的语音指令需求解析失败后,会进一步对语音指令进行需求预测,“猜测”用户可能的需求表述以返回给用户确认,而不是简单粗暴地告知用户无法理解用户所说的,从而提高用户的交互效率,提升用户体验。
2)对于语音助手通过问询形式返回的需求表述,用户仅需要简单地确认或选择即可实现有效的指令输入,而无需自己换一种方式进行语音指令的重新输入,进一步提高用户的交互效率,提升用户体验。
3)在需求理解失败后,在预测用户需求的同时会对需求理解失败的原因进行分析,并向用户返回分析得到的原因,从而减轻用户的疑惑和焦虑,进一步提升用户体验。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本申请保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等,均应包含在本申请保护范围之内。

Claims (19)

  1. 一种语音交互方法,其特征在于,该方法包括:
    对用户输入的第一语音指令进行语音识别和需求解析;
    若需求解析失败,则对所述第一语音指令进行需求预测,得到至少一个需求表述;
    将所述需求表述中的至少一个通过问询的形式返回给所述用户;
    若接收到所述用户确定所述需求表述中的至少一个的第二语音指令,则利用所述用户确定的需求表述对应的需求解析结果进行服务响应。
  2. 根据权利要求1所述的方法,其特征在于,该方法还包括:
    若需求解析成功,则利用所述第一语音指令对应的需求解析结果进行服务响应。
  3. 根据权利要求1所述的方法,其特征在于,对所述第一语音指令进行需求预测,得到至少一个需求表述包括:
    将所述第一语音指令输入预先训练得到的需求预测模型,所述需求预测模型将所述第一语音指令映射至至少一个需求表述。
  4. 根据权利要求3所述的方法,其特征在于,所述需求预测模型采用以下方式预先训练得到:
    获取训练数据,所述训练数据包括多个语句对,所述语句对包括第一语句和第二语句,其中第二语句能够被需求解析成功;
    利用所述训练数据训练序列到序列Seq2Seq模型,得到所述需求预测模型,其中语句对中第一语句作为所述Seq2Seq模型的输入,第二语句作为所述Seq2Seq模型的目标输出。
  5. 根据权利要求4所述的方法,其特征在于,所述训练数据从文本搜索日志中获取;
    其中将文本搜索请求query作为第一语句,利用query所对应的被点击搜索结果得到第二语句,将所述第一语句和所述第二语句构成语句对,第二语句的置信度由第一语句作为query时所述第二语句的被点击次数确定。
  6. 根据权利要求3所述的方法,其特征在于,将所述需求表述中的至少一个通过问询的形式返回给所述用户包括:
    将所述需求预测模型映射得到的至少一个需求表述中置信度最高的需求表述作为所述第一需求表述;
    将所述第一需求表述通过问询的形式返回给所述用户。
  7. 根据权利要求6所述的方法,其特征在于,将所述需求表述中的至少一个通过问询的形式返回给所述用户还包括:
    若接收到所述用户否定所述第一需求表述的第三语音指令,则将所述需求预测模型映射得到的至少一个需求表述中置信度次高的需求表述作为第二需求表述;
    将第二需求表述通过问询的形式返回给所述用户。
  8. 根据权利要求3所述的方法,其特征在于,将所述需求表述中的至少一个通过问询的形式返回给所述用户包括:
    将所述需求预测模型映射得到的至少一个需求表述中置信度排在前N个的需求表述,通过问询的形式返回给所述用户,所述N为预设的正整数。
  9. 根据权利要求1、6或8所述的方法,其特征在于,该方法还包括:
    分析所述需求解析失败的原因,在所述问询中进一步携带所述需求解析失败的原因。
  10. 根据权利要求9所述的方法,其特征在于,所述需求解析失败的原因包括:
    环境嘈杂、所述第一语音指令的长度超限、所述第一语音指令的发音不准确或所述第一语音指令口语化。
  11. 一种语音交互装置,其特征在于,该装置包括:
    语音交互单元,用于接收用户输入的第一语音指令;
    语音处理单元,用于对所述第一语音指令进行语音识别和需求解析;
    需求预测单元,用于若所述需求解析失败,则对所述第一语音指令进行需求预测,得到至少一个需求表述;
    所述语音交互单元,还用于将所述需求表述中的至少一个通过问询的形式返回给所述用户;
    服务响应单元,用于若所述语音交互单元接收到所述用户确定所述需求表述中的至少一个的第二语音指令,则利用所述用户确定的需求表 述对应的需求解析结果进行服务响应。
  12. 根据权利要求11所述的装置,其特征在于,所述服务响应单元,还用于若需求解析成功,则利用所述第一语音指令对应的需求解析结果进行服务响应。
  13. 根据权利要求11所述的装置,其特征在于,所述需求预测单元,具体用于将所述第一语音指令输入预先训练得到的需求预测模型,所述需求预测模型将所述第一语音指令映射至至少一个需求表述。
  14. 根据权利要求13所述的装置,其特征在于,该装置还包括:
    模型训练单元,用于获取训练数据,所述训练数据包括多个语句对,所述语句对包括第一语句和第二语句,其中第二语句能够被需求解析成功;利用所述训练数据训练Seq2Seq模型,得到所述需求预测模型,其中语句对中第一语句作为所述Seq2Seq模型的输入,第二语句作为所述Seq2Seq模型的目标输出。
  15. 根据权利要求13所述的装置,其特征在于,所述语音交互单元在将所述需求表述中的至少一个通过问询的形式返回给所述用户时,具体执行:
    将所述需求预测模型映射得到的至少一个需求表述中置信度最高的需求表述作为所述第一需求表述;
    将所述第一需求表述通过问询的形式返回给所述用户。
  16. 根据权利要求15所述的装置,其特征在于,所述语音交互单元在将所述需求表述中的至少一个通过问询的形式返回给所述用户时,还用于:
    若接收到所述用户否定所述第一需求表述的第三语音指令,则将所述需求预测模型映射得到的至少一个需求表述中置信度次高的需求表述作为第二需求表述;
    将第二需求表述通过问询的形式返回给所述用户。
  17. 根据权利要求13所述的装置,其特征在于,所述语音交互单元在将所述需求表述中的至少一个通过问询的形式返回给所述用户时,具体执行:
    将所述需求预测模型映射得到的至少一个需求表述中置信度排在前N个的需求表述,通过问询的形式返回给所述用户,所述N为预设的正 整数。
    根据权利要求11、15或17所述的装置,其特征在于,该装置还包括:
    原因分析单元,用于分析所述需求解析失败的原因,在所述问询中进一步携带所述需求解析失败的原因。
  18. 一种电子设备,其特征在于,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-10中任一项所述的方法。
  19. 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行权利要求1-10中任一项所述的方法。
PCT/CN2020/116018 2020-02-18 2020-09-17 一种语音交互方法、装置、设备和计算机存储介质 Ceased WO2021164244A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020217032708A KR20210137531A (ko) 2020-02-18 2020-09-17 음성 인터랙션 방법, 장치, 기기 및 컴퓨터 기록 매체
EP20864285.0A EP3896690B1 (en) 2020-02-18 2020-09-17 Voice interaction method and apparatus, device and computer storage medium
US17/279,540 US11978447B2 (en) 2020-02-18 2020-09-17 Speech interaction method, apparatus, device and computer storage medium
JP2021571465A JP2022531987A (ja) 2020-02-18 2020-09-17 音声インタラクション方法、装置、機器、及びコンピュータ記憶媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010099574.4A CN111341309A (zh) 2020-02-18 2020-02-18 一种语音交互方法、装置、设备和计算机存储介质
CN202010099574.4 2020-02-18

Publications (1)

Publication Number Publication Date
WO2021164244A1 true WO2021164244A1 (zh) 2021-08-26

Family

ID=71183485

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116018 Ceased WO2021164244A1 (zh) 2020-02-18 2020-09-17 一种语音交互方法、装置、设备和计算机存储介质

Country Status (6)

Country Link
US (1) US11978447B2 (zh)
EP (1) EP3896690B1 (zh)
JP (1) JP2022531987A (zh)
KR (1) KR20210137531A (zh)
CN (1) CN111341309A (zh)
WO (1) WO2021164244A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822533A (zh) * 2022-04-12 2022-07-29 广州小鹏汽车科技有限公司 语音交互方法、模型训练方法、电子设备和存储介质

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111341309A (zh) * 2020-02-18 2020-06-26 百度在线网络技术(北京)有限公司 一种语音交互方法、装置、设备和计算机存储介质
CN112017663B (zh) * 2020-08-14 2024-04-30 博泰车联网(南京)有限公司 一种语音泛化方法、装置及计算机存储介质
CN112017646A (zh) * 2020-08-21 2020-12-01 博泰车联网(南京)有限公司 一种语音处理方法、装置及计算机存储介质
CN112382290B (zh) 2020-11-20 2023-04-07 北京百度网讯科技有限公司 一种语音交互方法、装置、设备和计算机存储介质
CN112415908A (zh) * 2020-11-26 2021-02-26 珠海格力电器股份有限公司 智能设备控制方法、装置、可读存储介质和计算机设备
CN114664301A (zh) * 2022-03-28 2022-06-24 安胜(天津)飞行模拟系统有限公司 一种模拟训练设备交互控制方法、装置及系统
CN114786308A (zh) * 2022-04-21 2022-07-22 南京为中电子技术有限公司 智能楼宇照明控制方法、装置、设备及存储介质
CN115294976A (zh) * 2022-06-23 2022-11-04 中国第一汽车股份有限公司 一种基于车载语音场景的纠错交互方法、系统及其车辆
CN116467369A (zh) * 2023-03-06 2023-07-21 北京字跳网络技术有限公司 搜索结果的展示方法、装置、电子设备和存储介质
CN116705026B (zh) * 2023-08-02 2023-10-13 江西科技学院 一种人工智能交互方法及系统
CN117695144A (zh) * 2024-01-19 2024-03-15 深圳市东吉联医疗科技有限公司 一种基于语音交互的空气波自适应控制系统及方法
CN119396328A (zh) * 2024-10-12 2025-02-07 暗物质(北京)智能科技有限公司 一种触屏与语音融合交互方法、装置、计算机设备及可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105206266A (zh) * 2015-09-01 2015-12-30 重庆长安汽车股份有限公司 基于用户意图猜测的车载语音控制系统及方法
US20170140754A1 (en) * 2015-03-20 2017-05-18 Kabushiki Kaisha Toshiba Dialogue apparatus and method
CN107077843A (zh) * 2014-10-30 2017-08-18 三菱电机株式会社 对话控制装置和对话控制方法
CN107463311A (zh) * 2016-06-06 2017-12-12 苹果公司 智能列表读取
CN107516516A (zh) * 2017-08-21 2017-12-26 北京格致创想科技有限公司 基于语音交互的仪器智能控制方法及系统
CN108920622A (zh) * 2018-06-29 2018-11-30 北京奇艺世纪科技有限公司 一种意图识别的训练方法、训练装置和识别装置
CN111341309A (zh) * 2020-02-18 2020-06-26 百度在线网络技术(北京)有限公司 一种语音交互方法、装置、设备和计算机存储介质

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8000452B2 (en) * 2004-07-26 2011-08-16 General Motors Llc Method and system for predictive interactive voice recognition
US7437297B2 (en) * 2005-01-27 2008-10-14 International Business Machines Corporation Systems and methods for predicting consequences of misinterpretation of user commands in automated systems
JP2006227954A (ja) * 2005-02-18 2006-08-31 Pioneer Electronic Corp 情報処理装置及び情報処理方法等
JP2009025538A (ja) 2007-07-19 2009-02-05 Nissan Motor Co Ltd 音声対話装置
KR101759009B1 (ko) 2013-03-15 2017-07-17 애플 인크. 적어도 부분적인 보이스 커맨드 시스템을 트레이닝시키는 것
CN105869631B (zh) * 2015-01-21 2019-08-23 上海羽扇智信息科技有限公司 语音预测的方法和装置
US11423023B2 (en) * 2015-06-05 2022-08-23 Apple Inc. Systems and methods for providing improved search functionality on a client device
US9842106B2 (en) * 2015-12-04 2017-12-12 Mitsubishi Electric Research Laboratories, Inc Method and system for role dependent context sensitive spoken and textual language understanding with neural networks
CN110019662B (zh) * 2017-09-12 2022-10-18 阿里巴巴集团控股有限公司 一种标签重建方法及装置
CN108182229B (zh) * 2017-12-27 2022-10-28 上海科大讯飞信息科技有限公司 信息交互方法及装置
US10714084B2 (en) * 2018-02-09 2020-07-14 Accenture Global Solutions Limited Artificial intelligence based service implementation
CN108920497B (zh) * 2018-05-23 2021-10-15 北京奇艺世纪科技有限公司 一种人机交互方法及装置
JP7034027B2 (ja) * 2018-07-26 2022-03-11 ヤフー株式会社 認識装置、認識方法及び認識プログラム
CN110046221B (zh) 2019-03-01 2023-12-22 平安科技(深圳)有限公司 一种机器对话方法、装置、计算机设备及存储介质
CN110111788B (zh) * 2019-05-06 2022-02-08 阿波罗智联(北京)科技有限公司 语音交互的方法和装置、终端、计算机可读介质
CN110196894B (zh) * 2019-05-30 2021-06-08 北京百度网讯科技有限公司 语言模型的训练方法和预测方法
CN110288985B (zh) * 2019-06-28 2022-03-08 北京猎户星空科技有限公司 语音数据处理方法、装置、电子设备及存储介质
US11475223B2 (en) * 2019-07-30 2022-10-18 Adobe Inc. Converting tone of digital content
CN110459208B (zh) * 2019-09-09 2022-01-11 中科极限元(杭州)智能科技股份有限公司 一种基于知识迁移的序列到序列语音识别模型训练方法
CN110704703A (zh) 2019-09-27 2020-01-17 北京百度网讯科技有限公司 人机对话方法及装置
US11694682B1 (en) * 2019-12-11 2023-07-04 Amazon Technologies, Inc. Triggering voice control disambiguation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107077843A (zh) * 2014-10-30 2017-08-18 三菱电机株式会社 对话控制装置和对话控制方法
US20170140754A1 (en) * 2015-03-20 2017-05-18 Kabushiki Kaisha Toshiba Dialogue apparatus and method
CN105206266A (zh) * 2015-09-01 2015-12-30 重庆长安汽车股份有限公司 基于用户意图猜测的车载语音控制系统及方法
CN107463311A (zh) * 2016-06-06 2017-12-12 苹果公司 智能列表读取
CN107516516A (zh) * 2017-08-21 2017-12-26 北京格致创想科技有限公司 基于语音交互的仪器智能控制方法及系统
CN108920622A (zh) * 2018-06-29 2018-11-30 北京奇艺世纪科技有限公司 一种意图识别的训练方法、训练装置和识别装置
CN111341309A (zh) * 2020-02-18 2020-06-26 百度在线网络技术(北京)有限公司 一种语音交互方法、装置、设备和计算机存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3896690A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822533A (zh) * 2022-04-12 2022-07-29 广州小鹏汽车科技有限公司 语音交互方法、模型训练方法、电子设备和存储介质
CN114822533B (zh) * 2022-04-12 2023-05-12 广州小鹏汽车科技有限公司 语音交互方法、模型训练方法、电子设备和存储介质

Also Published As

Publication number Publication date
KR20210137531A (ko) 2021-11-17
CN111341309A (zh) 2020-06-26
JP2022531987A (ja) 2022-07-12
US11978447B2 (en) 2024-05-07
EP3896690B1 (en) 2023-03-15
US20220351721A1 (en) 2022-11-03
EP3896690A1 (en) 2021-10-20
EP3896690A4 (en) 2021-12-01

Similar Documents

Publication Publication Date Title
WO2021164244A1 (zh) 一种语音交互方法、装置、设备和计算机存储介质
US10540965B2 (en) Semantic re-ranking of NLU results in conversational dialogue applications
CN111177355B (zh) 基于搜索数据的人机对话交互方法、装置和电子设备
JP6317111B2 (ja) ハイブリッド型クライアントサーバ音声認識
US10755702B2 (en) Multiple parallel dialogs in smart phone applications
JP2021182168A (ja) 音声認識システム
WO2021232725A1 (zh) 基于语音交互的信息核实方法、装置、设备和计算机存储介质
US20210380118A1 (en) Method and apparatus for regulating user emotion, device, and readable storage medium
CN115309877A (zh) 对话生成方法、对话模型训练方法及装置
CN112466302A (zh) 语音交互的方法、装置、电子设备和存储介质
CN111402861A (zh) 一种语音识别方法、装置、设备及存储介质
CN111831813A (zh) 对话生成方法、装置、电子设备及介质
CN113743127B (zh) 任务型对话的方法、装置、电子设备及存储介质
US11170765B2 (en) Contextual multi-channel speech to text
CN117198289B (zh) 语音交互方法、装置、设备、介质及产品
CN112328776A (zh) 对话生成方法、装置、电子设备和存储介质
WO2023155678A1 (zh) 用于确定信息的方法和装置
US12505304B2 (en) Dialog data generating
CN114758649B (zh) 一种语音识别方法、装置、设备和介质
CN112652311B (zh) 中英文混合语音识别方法、装置、电子设备和存储介质
CN112559715A (zh) 态度的识别方法、装置、设备及存储介质
CN112650844A (zh) 对话状态的追踪方法、装置、电子设备和存储介质
WO2021098175A1 (zh) 录制语音包功能的引导方法、装置、设备和计算机存储介质
KR20210115645A (ko) 복수의 언어에 대한 음성 인식을 수행하는 음성 처리 서버, 방법 및 컴퓨터 프로그램
CN116050427B (zh) 信息生成方法、训练方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020864285

Country of ref document: EP

Effective date: 20210324

ENP Entry into the national phase

Ref document number: 20217032708

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021571465

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE