WO2021153201A1 - 情報処理装置及び情報処理方法 - Google Patents
情報処理装置及び情報処理方法 Download PDFInfo
- Publication number
- WO2021153201A1 WO2021153201A1 PCT/JP2021/000600 JP2021000600W WO2021153201A1 WO 2021153201 A1 WO2021153201 A1 WO 2021153201A1 JP 2021000600 W JP2021000600 W JP 2021000600W WO 2021153201 A1 WO2021153201 A1 WO 2021153201A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- information
- intake
- utterance
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Definitions
- This disclosure relates to an information processing device and an information processing method.
- Patent Document 1 a technique for determining an utterance timing in a voice dialogue system is known (for example, Patent Document 1).
- the utterance timing of the voice dialogue system is determined based on the timing when the user's breathing changes from exhalation to inspiration.
- the information processing apparatus of one form according to the present disclosure is said to be based on an acquisition unit that acquires inspiration information indicating a user's inspiration and the inspiration information acquired by the acquisition unit. It includes a prediction unit that predicts whether or not the user speaks after the user's inspiration.
- Embodiment 1-1 Outline of information processing according to the embodiment of the present disclosure 1-1-1. Background and effects 1-1-2. Sensor example 1-1-2-1. Contact type 1-1-2-2. Non-contact type 1-2. Configuration of Information Processing System According to Embodiment 1-3. Configuration of Information Processing Device According to Embodiment 1-4. Configuration of the terminal device according to the embodiment 1-5. Information processing procedure according to the embodiment 1-5-1. Procedure for processing related to information processing equipment 1-5-2. Procedure for processing related to information processing system 1-6. Processing example using classification results 1-6-1. Example of abbreviation of activation word due to respiratory condition 1-6-2. Example of switching between local / cloud voice recognition 1-6-3. Example of changing the voice recognition dictionary 1-6-4. Example of changing the UI selected according to the intake state 1-6-5. System response change example 2. Other Embodiments 2-1. Configuration example in which prediction processing is performed on the client side 2-2. Other configuration examples 2-3. Others 3. Effect of this disclosure 4. Hardware configuration
- FIG. 1 is a diagram showing an example of information processing according to the embodiment of the present disclosure.
- the information processing according to the embodiment of the present disclosure is realized by the information processing system 1 (see FIG. 5) including the server device 100 (see FIG. 6) and the terminal device 10 (see FIG. 8).
- the server device 100 is an information processing device that executes information processing according to the embodiment.
- the server device 100 predicts whether or not the user speaks after the user's intake based on the intake information indicating the user's intake.
- the case where the sensor information detected by the respiration sensor 171 (see FIG. 8) of the terminal device 10 used by the user is used as the inspiratory information is shown.
- the case where the inspiratory information indicating the user's inspiration is detected by the breathing sensor 171 using the millimeter wave radar is shown, but it is not limited to the millimeter wave radar and if the inspiratory information of the user can be detected. , Any sensor may be used. This point will be described later.
- FIG. 1 will be specifically described.
- the following is an example of a case where the server device 100 performs prediction processing for predicting whether or not the user U1 speaks after the intake of the user U1 by using the intake information indicating the intake of the user U1 detected by the terminal device 10. It is explained as.
- the server device 100 performs the prediction process information processing
- the terminal device 10 may perform the prediction process (information processing). This point will be described later.
- the server device 100 acquires the intake information BINF1 indicating the intake of the user U1.
- the server device 100 acquires the intake information BINF1 indicating the intake of the user U1 from the terminal device 10 used by the user U1.
- the terminal device 10 is a smart speaker.
- the terminal device 10 is not limited to the smart speaker and may be any device such as a smartphone, and details of this point will be described later.
- the server device 100 performs prediction processing using the intake information BINF1 indicating the intake of the user U1 (step S1).
- the server device 100 calculates the score using the intake information BINF1.
- the server device 100 predicts whether or not the user U1 speaks after the inspiration corresponding to the inspiration information BINF1 by comparing the calculated score with the threshold value.
- the server device 100 predicts that the user U1 speaks after the inspiration corresponding to the inspiration information BINF1.
- FIG. 2 is a diagram showing an example of user's intake information.
- FIG. 3 is a diagram showing an example of prediction using the user's intake air.
- the graph GR1 in FIG. 2 is a graph showing the relationship between time and the amount of intake air, and the horizontal axis shows time and the vertical axis shows the amount of intake air.
- the range between the line LN1 and the line LN2 in the graph GR1 indicates the normal respiration range (normal respiration range) of the user U1.
- the respiration volume corresponding to the line LN1 indicates the lower limit of the inspiratory volume during normal respiration.
- the respiration volume corresponding to the line LN2 indicates an upper limit value of the inspiratory volume during normal respiration.
- the breathing is repeated with the inspiratory amount in the breathing range (normal breathing range) of the line LN1 and the line LN2.
- the current intake amount "B_curent” which is the current intake amount of CR1 in the graph GR1 indicates the latest intake amount at the time of detection (current time) of the intake information BINF1. Further, for example, the increase amount "B_increase” in the graph GR1 indicates a change (increase amount) in the inspiratory amount at the current value CR1.
- the intake information BINF1 includes the current intake amount "B_curent” which is the intake amount of the current value CR1 in FIG. 2 and the increase amount "B_increase”.
- the intake information BINF1 may include a transition of the intake amount between the intake start time IS1 immediately before the current value CR1 in FIG. 2 and the current value CR1.
- the server device 100 may calculate the increase amount "B_increase” from the transition of the intake air amount.
- the increase amount "B_increase” may be information indicating the ratio of the increase amount of the inspiratory amount to the passage of time (increase amount).
- the amount of increase "B_increase” may be a slope (rate of change).
- the server device 100 calculates the utterance prediction score "Score_uttr_pr”, which is the score used for utterance prediction, using the current intake amount "B_curent”, the increase amount “B_increase”, and the following equation (1).
- the server device 100 predicts whether or not the user U1 speaks by using the utterance presence / absence prediction threshold value “Threshold_uttr_pr” which is a threshold value used for predicting the presence / absence of utterance.
- the server device 100 predicts whether or not the user U1 speaks by comparing the utterance prediction score "Score_uttr_pr” with the utterance presence / absence prediction threshold value "Threshold_uttr_pr”. That is, the server device 100 classifies the utterance type according to the value of the utterance prediction score "Score_uttr_pr".
- the server device 100 predicts whether or not the user U1 speaks by comparing the utterance prediction score “Score_uttr_pr” with the utterance presence / absence prediction threshold value “Threshold_uttr_pr”.
- the server device 100 predicts that the user U1 speaks when the utterance prediction score "Score_uttr_pr" is larger than the utterance presence / absence prediction threshold value "Threshold_uttr_pr”. In this case, the server device 100 executes preprocessing necessary for voice recognition, assuming that there is a high possibility that an utterance will be made after the current inspiration. The server device 100 executes preprocessing necessary for voice recognition as soon as it predicts that the user U1 will speak after the intake is completed. In the example of FIG. 2, the server device 100 executes the pre-processing required for voice recognition before the user U1 finishes the intake air (before reaching the maximum intake air amount “B_max”).
- the server device 100 predicts that the user U1 does not speak when the utterance prediction score "Score_uttr_pr" is equal to or less than the utterance presence / absence prediction threshold "Threshold_uttr_pr". In this case, the server device 100 does not pre-start because no utterance is expected after the current intake.
- the server device 100 predicts that the user U1 will speak when the utterance prediction score "Score_uttr_pr" is equal to or higher than the utterance prediction threshold "Threshold_uttr_pr", and the utterance prediction score "Score_uttr_pr” is less than the utterance prediction threshold "Threshold_uttr_pr". , User U1 may be predicted not to speak.
- the server device 100 predicts whether or not the user U1 speaks after the inspiration corresponding to the inspiration information BINF1 by comparing the utterance prediction score "Score_uttr_pr” with the utterance presence / absence prediction threshold "Threshold_uttr_pr”. .. It should be noted that each threshold value such as the utterance presence / absence prediction threshold value “Threshold_uttr_pr” may be increased / decreased according to the change in the normal breathing range according to the change in the exercise state of the user.
- the utterance prediction score "Score_uttr_pr” is a value that takes into account the current respiratory volume and the amount of increase.
- the server device 100 predicts the possibility of subsequent utterances from the current inspiratory volume and increased amount by using the utterance prediction score "Score_uttr_pr". As a result, the server device 100 can determine the possibility of utterance even before reaching the maximum point of intake air, and can be used for system preparation in advance.
- FIG. 4 is a diagram showing an example of the relationship between the user's inspiration and utterance.
- the waveform shown in FIG. 4 shows an example of the transition of the inspiratory volume from the user's normal respiration (steady respiration) to the end of the utterance.
- the inspiratory volume increases with the utterance of the user as compared with the steady breathing, and decreases with the utterance.
- the degree of increase in intake is related to the degree of urgency and the like.
- the maximum inspiratory amount is related to the utterance sentence amount, the utterance volume, and the like.
- the degree of decrease in inspiration can be used for predicting the end of utterance.
- the inspiratory amount at the end of the utterance can be used for predicting the possibility of continuing the utterance.
- the server device 100 may predict (estimate) various information regarding the user's utterance by using the information of the user's intake air as described above. As described above, the server device 100 can predict (estimate) various information regarding the user's utterance by using the information of the user's intake air as described above.
- the waveform at the time point before the current CR1 shown by the two-dot chain line in the graph GR1 in FIG. 2 indicates the predicted value of the respiratory volume.
- the server device 100 may predict the respiration volume at a time point before the current time CR1 based on the transition of the inspiratory amount up to the current time CR1 and the past respiration history of the user U1. In this case, the server device 100 can predict the maximum intake amount "B_max" indicating the maximum intake amount reached by the intake corresponding to the current CR1. As a result, the server device 100 can perform processing using the maximum respiration amount (maximum intake amount) described later in advance.
- the server device 100 determines whether the user speaks using the intake amount (initial intake amount) at the time when the user finishes the exhalation (exhaust) and starts the inspiration and the increase amount of the exhalation from that time. You may predict whether or not. In this case, in the example of FIG. 2, the server device 100 predicts whether or not the user speaks by using the intake amount (initial intake amount) of the intake start time IS1 and the increase amount from the intake start time IS1. do. As a result, the server device 100 can predict whether or not the user speaks in a short time after the user starts exhaling.
- the server device 100 is an example of the process using the above equation (1), and the server device 100 may perform the prediction process by various methods, not limited to the above-mentioned process.
- the server device 100 may perform prediction processing using a technique related to machine learning.
- the server device 100 may perform prediction processing using a model that outputs a score when intake information is input.
- the server device 100 may perform prediction processing using a model that outputs a higher score as the user is more likely to speak after the inspiration corresponding to the inspiration information.
- the server device 100 may train the model using learning data including a combination of inspiration information indicating the user's inspiration and information indicating the presence or absence of speech after the inspiration, or the model may be externally used. It may be obtained from an information processing device.
- the information processing system 1 starts the activation process (step S11).
- the information processing system 1 performs preprocessing necessary for voice recognition.
- the information processing system 1 performs processing such as activation of a microphone and communication connection (connection to the cloud) between a client and a server.
- the server device 100 instructs the terminal device 10 to start the microphone and start the voice recognition.
- the terminal device 10 activates the microphone and the voice recognition. In this way, when the information processing system 1 is predicted to have an utterance, the information processing system 1 executes pre-processing necessary for voice recognition in advance.
- the information processing system 1 causes the user U1 to recognize the activation (step S12).
- the information processing system 1 performs a process of clearly indicating the activation of voice recognition or the like by outputting sound or light from the terminal device 10.
- the terminal device 10 performs a WakeUpResponse (hereinafter, also simply referred to as “start notification”) by emitting a notification sound or light indicating the activation of voice recognition.
- start notification a WakeUpResponse
- the terminal device 10 causes the user U1 to recognize the activation by turning on the light source unit 18. As a result, the user U1 can recognize that the input by voice has become possible.
- the user U1 speaks (step S13). For example, the user U1 performs voice input requesting predetermined information from the terminal device 10. The user U1 performs a voice input requesting the terminal device 10 to perform a search process.
- the information processing system 1 performs conventional processing (processing of the voice dialogue system) on the input by the user U1 (step S14).
- the information processing system 1 interprets a user's input and executes a corresponding process (Action) by natural language understanding (NLU: Natural Language Understanding).
- NLU Natural Language Understanding
- the server device 100 or the terminal device 10 interprets the user's input by natural language understanding (NLU) and executes the corresponding process (Action).
- NLU Natural Language Understanding
- the terminal device 10 processes the voice output in response to the request of the user U1 as "OK. Here's the result .
- the terminal device 10 processes the voice output in response to the request of the user U1 as "OK. Here's the result .
- the terminal device 10 processes the voice output in response to the request of the user U1 as "OK. Here's the result .
- the terminal device 10 processes the voice output in response to the request of the user U1 as "OK. Here's the result .
- the terminal device 10 processes the voice
- step S11 the information processing system 1 does not start the activation process.
- the information processing system 1 does not activate voice recognition.
- the terminal device 10 does not activate voice recognition.
- the information processing system 1 makes it possible to omit the activation word due to the pre-speech breathing state by deciding whether or not to activate voice recognition or the like by using the utterance prediction result based on the user's inspiration. be able to.
- the information processing system 1 detects the respiratory state before the user speaks and dynamically changes the voice UI system according to the state. As a result, the information processing system 1 can omit the activation word at the time of breathing before making a request utterance to the system. Therefore, the information processing system 1 can improve usability.
- respiration sensor 171 for detecting the inspiration information indicating the user's inspiration is used as an example of the case where the respiration sensor 171 is used is not limited to the millimeter wave radar, but the user's inspiration information is not limited to the respiration sensor 171. Any sensor may be used as long as it can detect. An example of this point will be described below.
- the breathing sensor 171 using the millimeter wave radar that is, the detection of the inspiratory information using the non-contact type sensor has been described as an example, but the sensor used for detecting (acquiring) the inspiratory information is non-contact.
- the type is not limited to the contact type. An example of the contact type sensor will be described below.
- the respiration sensor 171 may be a wearable sensor.
- contact type sensors of various modes such as a band type, a jacket type, and a mask type may be used.
- the information processing system 1 acquires the displacement amount of breathing from the expansion and contraction of the band wrapped around the user's chest or abdomen.
- the information processing system 1 embeds a band in a jacket worn by the user.
- the accuracy of respiration detection can be improved by equipping sensors at a plurality of locations (directions).
- the information processing system 1 When an acceleration sensor is used for the breathing sensor 171, the information processing system 1 observes the movement of the chest by an acceleration sensor mounted on a wearable device such as a neck-hanging device or a smartphone worn on the user's upper body, and breathes. The amount may be estimated. When a mask type sensor is used for the breathing sensor 171, the information processing system 1 detects the exhalation and inspiration speeds by the air volume sensor or the pressure sensor mounted on the mask, and calculates the depth and period from the accumulated displacement amount. presume.
- a VR (Virtual Reality) headset that covers the user's mouth may be used for the breathing sensor 171.
- the information processing system 1 recognizes the sound of exhaled breath by the proximity microphone, recognizes the amount of time change in exhalation, and estimates the depth and speed of breathing.
- the information processing system 1 recognizes the sound of noise generated when the exhaled breath hits the microphone by the proximity microphone, recognizes the amount of time change of exhalation, and estimates the depth and speed of breathing.
- non-contact type sensor is not limited to the millimeter wave radar, and various non-contact type sensors may be used for the respiration sensor 171.
- An example of a non-contact type sensor other than the millimeter wave radar will be described below.
- a radar other than an image sensing method a respiration detection method from the temperature around the nose, a proximity sensor, and a millimeter wave radar may be used.
- the information processing system 1 When image sensing is used for the respiration sensor 171, the information processing system 1 recognizes the amount of time change between exhalation and inspiration at different temperatures with a thermo camera, and estimates the depth, cycle, and speed of respiration. Further, the information processing system 1 may perform image sensing of the exhaled breath that becomes white when it is cold, recognize the amount of time change of the exhaled breath, and estimate the depth, cycle, and speed of the breath.
- ⁇ Capacitive film-like proximity sensor that monitors human movement and breathing ⁇ https://www.aist.go.jp/aist_j/press_release/pr2016/pr20160125/pr20160125.html>
- Heart rate / respiration detection sensor "GZS-350 series” https://www.ipros.jp/product/detail/2000348329/>
- the information processing system 1 detects the movement of the user's chest by the phase difference of the received signal of the millimeter-wave radar and estimates the respiratory volume.
- the terminal device 10 uses the sensor information detected by the respiration sensor 171 to detect the movement of the user's chest by the phase difference of the received signal of the millimeter-wave radar, and estimates the respiration volume to inhale the user's inspiration. Generate information. Then, the terminal device 10 transmits the generated user intake information to the server device 100.
- the server device 100 may generate the intake information of the user.
- the terminal device 10 transmits the sensor information detected by the respiration sensor 171 to the server device 100. Then, using the sensor information received by the server device 100 that has received the sensor information, the movement of the user's chest is detected by the phase difference of the received signal of the millimeter wave radar, and the respiration amount is estimated, thereby inhaling information of the user. May be generated.
- the above sensor is merely an example of a sensor used for acquiring intake information, and any sensor may be used as long as it can acquire intake information.
- the information processing system 1 may use any sensor to detect the inspiratory information as long as it can detect the inspiratory information indicating the inspiratory air of the user.
- the sensor unit 17 of the terminal device 10 has at least one of the above-mentioned sensors, and detects intake information by the sensor.
- the information processing system 1 may generate intake information using the sensor information detected by the sensor of the sensor unit 17.
- the terminal device 10 and the server device 100 may generate inspiratory information using sensor information (point group data) detected by the respiration sensor 171 (millimeter wave radar).
- the terminal device 10 and the server device 100 may generate inspiration information from the sensor information (point group data) detected by the respiration sensor 171 (millimeter wave radar) by appropriately using various techniques.
- FIG. 5 is a diagram showing a configuration example of the information processing system according to the embodiment.
- the information processing system 1 shown in FIG. 5 may include a plurality of terminal devices 10 and a plurality of server devices 100.
- the server device 100 is a computer that predicts whether or not the user speaks after the user's inspiration based on the inspiration information indicating the user's inspiration.
- the server device 100 classifies the user's intake air based on the user's intake air information. Further, the server device 100 is a computer that transmits various information to the terminal device 10.
- the server device 100 is a server device used to provide services related to various functions.
- the server device 100 may have software modules such as voice signal processing, voice recognition, utterance semantic analysis, and dialogue control.
- the server device 100 may have a voice recognition function.
- the server device 100 may have functions of natural language understanding (NLU) and automatic speech recognition (ASR: Automatic Speech Recognition).
- NLU natural language understanding
- ASR Automatic Speech Recognition
- the server device 100 may estimate information about a user's intent (intention) or entity (target) from input information uttered by the user.
- the server device 100 functions as a voice recognition server having functions of natural language understanding and automatic voice recognition.
- the terminal device 10 is a terminal device that detects intake information indicating the user's intake with a sensor. For example, the terminal device 10 detects inspiration information indicating the inspiration of the user by the respiration sensor 171.
- the terminal device 10 is an information processing device that transmits user intake information to a server device such as the server device 100. Further, the terminal device 10 may have a voice recognition function such as natural language understanding and automatic voice recognition. For example, the terminal device 10 may estimate information about a user's intent (intention) or entity (target) from input information uttered by the user.
- the terminal device 10 is a device device used by the user.
- the terminal device 10 accepts input by the user.
- the terminal device 10 accepts voice input by the user's utterance and input by the user's operation.
- the terminal device 10 displays information according to the input of the user.
- the terminal device 10 may be any device as long as the processing in the embodiment can be realized.
- the terminal device 10 may be any device as long as it has a function of detecting the intake information of the user and transmitting it to the server device 100.
- the terminal device 10 is a device such as a smart speaker, a television, a smartphone, a tablet terminal, a notebook PC (Personal Computer), a desktop PC, a mobile phone, or a PDA (Personal Digital Assistant). You may.
- the terminal device 10 may be a wearable terminal (Wearable Device) or the like that the user can wear.
- the terminal device 10 may be a wristwatch-type terminal, a glasses-type terminal, or the like.
- FIG. 6 is a diagram showing a configuration example of the server device 100 according to the embodiment of the present disclosure.
- the server device 100 includes a communication unit 110, a storage unit 120, and a control unit 130.
- the server device 100 has an input unit (for example, a keyboard, a mouse, etc.) that receives various operations from the administrator of the server device 100, and a display unit (for example, a liquid crystal display, etc.) for displaying various information. You may.
- the communication unit 110 is realized by, for example, a NIC (Network Interface Card) or the like. Then, the communication unit 110 is connected to the network N (see FIG. 5) by wire or wirelessly, and transmits / receives information to / from another information processing device such as the terminal device 10. Further, the communication unit 110 may send and receive information to and from a user terminal (not shown) used by the user.
- a NIC Network Interface Card
- the storage unit 120 is realized by, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk. As shown in FIG. 6, the storage unit 120 according to the embodiment includes an intake information storage unit 121, a user information storage unit 122, a threshold information storage unit 123, and a functional information storage unit 124.
- a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory)
- flash memory Flash Memory
- FIG. 6 the storage unit 120 according to the embodiment includes an intake information storage unit 121, a user information storage unit 122, a threshold information storage unit 123, and a functional information storage unit 124.
- the storage unit 120 stores various information other than the above.
- the storage unit 120 stores information of a voice recognition application (program) that realizes a voice recognition function.
- the server device 100 can execute voice recognition by activating a voice recognition application (also simply referred to as "voice recognition").
- the storage unit 120 stores various information used for voice recognition.
- the storage unit 120 stores the information of the dictionary (speech recognition dictionary) used for the voice recognition dictionary.
- the storage unit 120 stores information from a plurality of voice recognition dictionaries.
- the storage unit 120 stores information such as a voice recognition dictionary for long sentences (dictionary for long sentences), a voice recognition dictionary for Chinese sentences (dictionary for Chinese sentences), and a voice recognition dictionary for short sentences (dictionary for words / phrases).
- the intake information storage unit 121 stores various information related to the user's intake air.
- the intake information storage unit 121 stores various information such as intake information of each user in association with the identification information (user ID) of each user.
- the intake information storage unit 121 stores intake information including an increase amount of the user's intake.
- the intake information storage unit 121 stores intake information including the intake amount of the user's intake air.
- the intake information storage unit 121 stores intake information including the initial intake amount at the start of the user's intake.
- the intake information storage unit 121 stores intake information including the maximum intake amount of the user's intake.
- the inspiration information storage unit 121 stores time information indicating the start time of utterance after the user's inspiration.
- the intake information storage unit 121 stores utterance information including the length and number of characters of the utterance after the user's inspiration.
- the intake information storage unit 121 is not limited to the above, and may store various information depending on the purpose.
- the inspiratory information storage unit 121 may store not only inspiratory information but also information related to the user's respiration.
- the intake information storage unit 121 may store information regarding the user's intake.
- the intake information storage unit 121 may store various types of information necessary for generating the graphs GR1 to GR5.
- the intake information storage unit 121 may store various types of information shown in the graphs GR1 to GR5.
- the user information storage unit 122 stores various information about the user.
- the user information storage unit 122 stores various information such as attribute information of each user.
- the user information storage unit 122 stores information about the user such as the user ID, age, gender, and place of residence.
- the user information storage unit 122 stores information about the user U1 such as the age, gender, and place of residence of the user U1 in association with the user ID "U1" that identifies the user U1.
- the user information storage unit 122 stores information for identifying a device (television, smartphone, etc.) used by each user in association with the user.
- the user information storage unit 122 stores information (terminal ID, etc.) that identifies the terminal device 10 used by each user in association with the user.
- the user information storage unit 122 is not limited to the above, and may store various information depending on the purpose.
- the user information storage unit 122 may store other demographic attribute information and psychographic attribute information regardless of age and gender.
- the user information storage unit 122 may store information such as a name, a home, a place of work, an interest, a family structure, an income, and a lifestyle.
- the threshold information storage unit 123 stores various information related to the threshold value.
- the threshold information storage unit 123 stores various information related to the threshold used for the prediction process and the classification process.
- FIG. 7 is a diagram showing an example of the threshold information storage unit according to the embodiment.
- the threshold information storage unit 123 shown in FIG. 7 includes items such as "threshold ID”, “use”, “threshold name”, and "value”.
- Theshold ID indicates identification information for identifying the threshold value.
- Use indicates the use of the threshold.
- the “threshold name” indicates the name (character string) of the threshold (variable) used as the threshold identified by the corresponding threshold ID.
- the “value” indicates a specific value of the threshold value identified by the corresponding threshold ID.
- threshold value TH1 threshold value identified by the threshold value ID “TH1” is to predict the presence or absence of utterance.
- the threshold TH1 indicates that it is used as the threshold name "Threshold_uttr_pr”.
- the value of the threshold TH1 indicates that it is "VL1".
- the value is indicated by an abstract code such as "VL1”, but the value is assumed to be a specific numerical value such as "0.5” or "1.8”.
- the threshold information storage unit 123 stores various threshold values corresponding to Thrashold_uttr, Throshold_ask, and the like shown in FIG. Further, the threshold information storage unit 123 is not limited to the above, and may store various information depending on the purpose.
- the functional information storage unit 124 stores various information related to the function.
- the function information storage unit 124 stores information about each function executed in response to user input.
- the function information storage unit 124 stores information regarding inputs required for executing the function.
- the function information storage unit 124 stores input items necessary for executing each function.
- the functional information storage unit 124 is not limited to the above, and may store various information depending on the purpose.
- control unit 130 for example, a program (for example, an information processing program according to the present disclosure) stored inside the server device 100 by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like is stored in a RAM (Random Access Memory). ) Etc. are executed as a work area. Further, the control unit 130 is realized by, for example, an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).
- ASIC Application Specific Integrated Circuit
- FPGA Field Programmable Gate Array
- control unit 130 includes an acquisition unit 131, a prediction unit 132, a selection unit 133, an execution unit 134, and a transmission unit 135, and functions and operations of information processing described below. To realize or execute.
- the internal configuration of the control unit 130 is not limited to the configuration shown in FIG. 6, and may be another configuration as long as it is a configuration for performing information processing described later.
- connection relationship of each processing unit included in the control unit 130 is not limited to the connection relationship shown in FIG. 6, and may be another connection relationship.
- the acquisition unit 131 acquires various information.
- the acquisition unit 131 acquires various information from an external information processing device.
- the acquisition unit 131 acquires various information from the terminal device 10.
- the acquisition unit 131 acquires various information from the terminal device 10 from the information detected by the sensor unit 17 of the terminal device 10.
- the acquisition unit 131 acquires various information from the terminal device 10 from the information detected by the breathing sensor 171 of the sensor unit 17.
- the acquisition unit 131 acquires various information from the storage unit 120.
- the acquisition unit 131 acquires various information from the intake information storage unit 121, the user information storage unit 122, the threshold information storage unit 123, and the functional information storage unit 124.
- the acquisition unit 131 acquires various information predicted by the prediction unit 132.
- the acquisition unit 131 acquires various information selected by the selection unit 133.
- the acquisition unit 131 acquires intake information indicating the user's intake.
- the acquisition unit 131 acquires intake information including an increase amount of the user's intake air.
- the acquisition unit 131 acquires intake information including the intake amount of the user's intake.
- the acquisition unit 131 acquires intake information including the initial intake amount at the start of the user's intake.
- the acquisition unit 131 acquires intake information including the maximum intake amount of the user's intake.
- the acquisition unit 131 acquires time point information indicating the start time of the utterance after the user's inhalation.
- the acquisition unit 131 acquires utterance information including the length of the utterance after the user's inhalation and the number of characters. For example, the acquisition unit 131 acquires the intake information BINF1 indicating the intake of the user U1.
- the prediction unit 132 predicts various types of information.
- the prediction unit 132 classifies various types of information.
- the prediction unit 132 calculates various types of information.
- the prediction unit 132 determines various information.
- the prediction unit 132 makes various determinations.
- the prediction unit 132 determines various information. For example, the prediction unit 132 predicts various types of information based on information from an external information processing device and information stored in the storage unit 120.
- the prediction unit 132 predicts various types of information based on information from other information processing devices such as the terminal device 10.
- the prediction unit 132 predicts various types of information based on the information stored in the intake information storage unit 121, the user information storage unit 122, the threshold information storage unit 123, and the functional information storage unit 124.
- the prediction unit 132 classifies various types of information. For example, the prediction unit 132 classifies various types of information based on information from an external information processing device and information stored in the storage unit 120.
- the prediction unit 132 predicts various information based on various information acquired by the acquisition unit 131.
- the prediction unit 132 predicts various information based on various information selected by the selection unit 133.
- the prediction unit 132 makes various judgments based on the prediction. Various judgments are made based on the information acquired by the acquisition unit 131.
- the prediction unit 132 calculates the score based on the intake information.
- the prediction unit 132 calculates a score used for predicting the user's utterance based on the intake information.
- the prediction unit 132 predicts whether or not the user speaks after the user's inspiration based on the inspiration information acquired by the acquisition unit 131.
- the prediction unit 132 predicts whether or not the user speaks after the inspiration based on the amount of increase.
- the prediction unit 132 predicts whether or not the user speaks after the inspiration based on the inspiratory amount, and the prediction unit 132 predicts whether or not the user speaks after the inspiration based on the initial inspiratory amount. ..
- the prediction unit 132 calculates the score using the intake information and a predetermined formula.
- the prediction unit 132 predicts whether or not the user speaks after the inspiration by using the score calculated based on the inspiration information.
- the prediction unit 132 compares the score with the threshold value, and predicts whether or not the user speaks after inspiration after inspiration based on the comparison result.
- the prediction unit 132 predicts that the user speaks after inspiration when the comparison result between the score and the threshold value satisfies a predetermined condition. If the score is greater than the threshold, the prediction unit 132 predicts that the user will speak after inspiration.
- the prediction unit 132 calculates the utterance prediction score “Score_uttr_pr”, which is the score used for utterance prediction, using the current inspiratory amount “B_curent”, the increase amount “B_increase”, and the equation (1). For example, the prediction unit 132 predicts that the user U1 will speak when the utterance prediction score "Score_uttr_pr" is larger than the utterance presence / absence prediction threshold "Threshold_uttr_pr".
- the prediction unit 132 predicts that the user U1 does not speak when the utterance prediction score "Score_uttr_pr" is equal to or less than the utterance presence / absence prediction threshold "Threshold_uttr_pr”.
- the prediction unit 132 classifies the user's intake air based on the intake air information.
- the prediction unit 132 classifies the user's intake air based on the maximum intake air amount.
- the prediction unit 132 classifies the user's inspiration based on the interval between the time of the maximum inspiration amount and the time of the start of utterance.
- the prediction unit 132 classifies the user's inspiration based on the length of the utterance and the number of characters.
- the prediction unit 132 classifies the user's intake air into at least one of a plurality of types including a request type intake and a non-request type intake.
- the prediction unit 132 classifies the user's inspiration into at least one of a plurality of types including a long-sentence inspiration and a short-sentence inspiration.
- the prediction unit 132 classifies the user's intake air into at least one of a plurality of types including a normal processing desired type intake and a shortened processing desired type intake.
- the selection unit 133 selects various information.
- the selection unit 133 extracts various information.
- the selection unit 133 specifies various types of information.
- the selection unit 133 selects various information based on the information from the external information processing device and the information stored in the storage unit 120.
- the selection unit 133 selects various types of information based on information from other information processing devices such as the terminal device 10.
- the selection unit 133 selects various information based on the information stored in the intake information storage unit 121, the user information storage unit 122, the threshold information storage unit 123, and the function information storage unit 124.
- the selection unit 133 selects various information based on the various information acquired by the acquisition unit 131.
- the selection unit 133 selects various information based on the various information predicted by the prediction unit 132.
- the selection unit 133 selects various information based on the processing executed by the execution unit 134.
- the selection unit 133 performs selection processing according to the classification result by the prediction unit 132.
- the selection unit 133 selects the process to be executed according to the classification result by the prediction unit 132.
- the selection unit 133 selects information to be used for processing the user's utterance according to the classification result by the prediction unit 132.
- Execution unit 134 executes various processes.
- the execution unit 134 determines the execution of various processes.
- the execution unit 134 executes various processes based on information from an external information processing device.
- the execution unit 134 executes various processes based on the information stored in the storage unit 120.
- the execution unit 134 executes various processes based on the information stored in the intake information storage unit 121, the user information storage unit 122, the threshold information storage unit 123, and the function information storage unit 124.
- the execution unit 134 executes various processes based on various information acquired by the acquisition unit 131.
- the execution unit 134 executes various processes based on various information predicted by the prediction unit 132.
- the execution unit 134 executes various processes based on various information selected by the selection unit 133.
- Execution unit 134 generates various information.
- the execution unit 134 generates various information based on the information from the external information processing device and the information stored in the storage unit 120.
- the execution unit 134 generates various information based on information from other information processing devices such as the terminal device 10.
- the execution unit 134 generates various information based on the information stored in the intake information storage unit 121, the user information storage unit 122, the threshold information storage unit 123, and the function information storage unit 124.
- the execution unit 134 executes processing according to the prediction result by the prediction unit 132.
- the prediction unit 132 predicts that the user will speak after inhalation
- the execution unit 134 executes preprocessing related to voice recognition.
- the execution unit 134 executes the pre-processing before the user's intake is completed.
- the execution unit 134 executes voice recognition when the prediction unit 132 predicts that the user will speak after inhalation.
- the execution unit 134 executes a pre-processing to activate and indicate voice recognition before the end of the user's inspiration. For example, when the user U1 is predicted to speak, the execution unit 134 executes preprocessing necessary for voice recognition, assuming that there is a high possibility that the user U1 will speak after the current inspiration.
- the execution unit 134 executes a pre-processing instructing the terminal device 10 to start voice recognition.
- the prediction unit 132 predicts that the user speaks after the inspiration
- the execution unit 134 executes a pre-processing for instructing the terminal device 10 to start voice recognition before the user's inspiration ends.
- the transmission unit 135 transmits various information.
- the transmission unit 135 transmits various information to an external information processing device.
- the transmission unit 135 provides various information to an external information processing device.
- the transmission unit 135 transmits various information to another information processing device such as the terminal device 10.
- the transmission unit 135 provides the information stored in the storage unit 120.
- the transmission unit 135 transmits the information stored in the storage unit 120.
- the transmission unit 135 provides various types of information based on information from other information processing devices such as the terminal device 10.
- the transmission unit 135 provides various types of information based on the information stored in the storage unit 120.
- the transmission unit 135 provides various information based on the information stored in the intake information storage unit 121, the user information storage unit 122, the threshold information storage unit 123, and the functional information storage unit 124.
- the transmission unit 135 transmits information indicating a function to be executed by the terminal device 10 to the terminal device 10.
- the transmission unit 135 transmits information indicating the function determined to be executed by the execution unit 134 to the terminal device 10.
- the transmission unit 135 transmits various types of information to the terminal device 10 in response to an instruction from the execution unit 134.
- the transmission unit 135 transmits information instructing the terminal device 10 to start the voice recognition application.
- the prediction unit 132 predicts that the user will speak after inhalation
- the transmission unit 135 transmits information instructing the terminal device 10 to start voice recognition.
- the transmission unit 135 transmits information instructing the terminal device 10 to start voice recognition before the user's inspiration ends.
- FIG. 8 is a diagram showing a configuration example of the terminal device according to the embodiment of the present disclosure.
- the terminal device 10 includes a communication unit 11, an input unit 12, an output unit 13, a storage unit 14, a control unit 15, a display unit 16, a sensor unit 17, and a light source unit 18. And have.
- the communication unit 11 is realized by, for example, a NIC or a communication circuit.
- the communication unit 11 is connected to the network N (Internet or the like) by wire or wirelessly, and transmits / receives information to / from other devices such as the server device 100 via the network N.
- the input unit 12 accepts various inputs.
- the input unit 12 receives the detection by the sensor unit 17 as an input.
- the input unit 12 receives an input of intake information indicating the user's intake.
- the input unit 12 receives the input of the intake air information detected by the sensor unit 17.
- the input unit 12 receives the input of the inspiratory information detected by the respiration sensor 171.
- the input unit 12 receives input of inspiratory information based on the point cloud data detected by the respiration sensor 171.
- the input unit 12 accepts the input of the user's utterance information.
- the input unit 12 receives the input of the inspiratory information of the user who inputs by the body movement.
- the input unit 12 accepts the user's gesture and line of sight as input.
- the input unit 12 receives sound as input by a sensor unit 17 having a function of detecting voice.
- the input unit 12 receives the voice information detected by the microphone (sound sensor) that detects the voice as the input information.
- the input unit 12 receives the voice spoken by the user as input information.
- the input unit 12 may accept an operation (user operation) on the terminal device 10 used by the user as an operation input by the user.
- the input unit 12 may receive information regarding the operation of the user using the remote controller (remote controller) via the communication unit 11.
- the input unit 12 may have a button provided on the terminal device 10 or a keyboard or mouse connected to the terminal device 10.
- the input unit 12 may have a touch panel capable of realizing functions equivalent to those of a remote controller, a keyboard, and a mouse.
- various information is input to the input unit 12 via the display unit 16.
- the input unit 12 receives various operations from the user via the display screen by the function of the touch panel realized by various sensors. That is, the input unit 12 receives various operations from the user via the display unit 16 of the terminal device 10.
- the input unit 12 receives an operation such as a user's designated operation via the display unit 16 of the terminal device 10.
- the input unit 12 functions as a reception unit that receives a user's operation by the function of the touch panel.
- the input unit 12 and the reception unit 153 may be integrated.
- the capacitance method is mainly adopted in the tablet terminal, but other detection methods such as the resistance film method, the surface acoustic wave method, the infrared method, and the electromagnetic induction method are used. Any method may be adopted as long as the user's operation can be detected and the touch panel function can be realized.
- the input unit 12 accepts the utterance of the user U1 as an input.
- the input unit 12 receives the utterance of the user U1 detected by the sensor unit 17 as an input.
- the input unit 12 receives the utterance of the user U1 detected by the sound sensor of the sensor unit 17 as an input.
- the output unit 13 outputs various information.
- the output unit 13 has a function of outputting audio.
- the output unit 13 has a speaker that outputs sound.
- the output unit 13 outputs various information by voice according to the control by the execution unit 152.
- the output unit 13 outputs information by voice to the user.
- the output unit 13 outputs the information displayed on the display unit 16 by voice.
- the storage unit 14 is realized by, for example, a semiconductor memory element such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk.
- the storage unit 14 stores information of a voice recognition application (program) that realizes the voice recognition function.
- the terminal device 10 can execute voice recognition by activating a voice recognition application.
- the storage unit 14 stores various information used for displaying the information.
- the storage unit 14 stores various information used for voice recognition.
- the storage unit 14 stores the information of the dictionary (speech recognition dictionary) used for the voice recognition dictionary.
- the control unit 15 is realized by, for example, a CPU, an MPU, or the like executing a program stored inside the terminal device 10 (for example, an information processing program according to the present disclosure) using a RAM or the like as a work area. Further, the control unit 15 may be realized by an integrated circuit such as an ASIC or FPGA.
- control unit 15 includes a reception unit 151, an execution unit 152, a reception unit 153, and a transmission unit 154, and realizes or executes the information processing functions and operations described below. ..
- the internal configuration of the control unit 15 is not limited to the configuration shown in FIG. 8, and may be another configuration as long as it is a configuration for performing information processing described later.
- the receiving unit 151 receives various information.
- the receiving unit 151 receives various information from an external information processing device.
- the receiving unit 151 receives various information from other information processing devices such as the server device 100.
- the receiving unit 151 receives information instructing the activation of voice recognition from the server device 100.
- the receiving unit 151 receives information instructing the start of the voice recognition application from the server device 100.
- the receiving unit 151 receives execution instructions of various functions from the server device 100. For example, the receiving unit 151 receives information specifying a function from the server device 100 as a function execution instruction. The receiving unit 151 receives the content. The receiving unit 151 receives the content to be displayed from the server device 100.
- Execution unit 152 executes various processes.
- the execution unit 152 determines the execution of various processes.
- the execution unit 152 executes various processes based on information from an external information processing device.
- the execution unit 152 executes various processes based on the information from the server device 100.
- the execution unit 152 executes various processes in response to an instruction from the server device 100.
- the execution unit 152 executes various processes based on the information stored in the storage unit 14.
- the execution unit 152 activates voice recognition.
- the execution unit 152 controls various outputs.
- the execution unit 152 controls the audio output by the output unit 13.
- the execution unit 152 controls the lighting of the light source unit 18.
- the execution unit 152 controls various displays.
- the execution unit 152 controls the display of the display unit 16.
- the execution unit 152 controls the display of the display unit 16 in response to the reception by the reception unit 151.
- the execution unit 152 controls the display of the display unit 16 based on the information received by the reception unit 151.
- the execution unit 152 controls the display of the display unit 16 based on the information received by the reception unit 153.
- the execution unit 152 controls the display of the display unit 16 in response to the reception by the reception unit 153.
- Reception department 153 receives various information.
- the reception unit 153 receives input by the user via the input unit 12.
- the reception unit 153 accepts the utterance by the user as an input.
- the reception unit 153 accepts operations by the user.
- the reception unit 153 accepts the user's operation on the information displayed by the display unit 16.
- the reception unit 153 accepts character input by the user.
- the transmission unit 154 transmits various information to an external information processing device.
- the transmission unit 154 transmits various information to another information processing device such as the terminal device 10.
- the transmission unit 154 transmits the information stored in the storage unit 14.
- the transmission unit 154 transmits various types of information based on information from other information processing devices such as the server device 100.
- the transmission unit 154 transmits various types of information based on the information stored in the storage unit 14.
- the transmission unit 154 transmits the sensor information detected by the sensor unit 17 to the server device 100.
- the transmission unit 154 transmits the intake information of the user U1 detected by the respiration sensor 171 of the sensor unit 17 to the server device 100.
- the transmission unit 154 transmits the input information input by the user to the server device 100.
- the transmission unit 154 transmits the input information voice-input by the user to the server device 100.
- the transmission unit 154 transmits the input information input by the user's operation to the server device 100.
- the transmission unit 154 transmits the intake information indicating the user's intake to the server device 100.
- the transmission unit 154 transmits the intake information including the increase amount of the user's intake to the server device 100.
- the transmission unit 154 transmits the intake information including the intake amount of the user's intake air to the server device 100.
- the transmission unit 154 transmits the intake information including the initial intake amount at the start of the intake of the user to the server device 100.
- the transmission unit 154 transmits the intake information including the maximum intake amount of the user's intake to the server device 100.
- the transmission unit 154 transmits the time point information indicating the utterance start time point after the user's inhalation to the server device 100.
- the transmission unit 154 transmits utterance information including the length of utterance and the number of characters after the user's inhalation to the server device 100.
- the display unit 16 is provided on the terminal device 10 and displays various information.
- the display unit 16 is realized by, for example, a liquid crystal display, an organic EL (Electro-Luminescence) display, or the like.
- the display unit 16 may be realized by any means as long as the information provided by the server device 100 can be displayed.
- the display unit 16 displays various information according to the control by the execution unit 152.
- the display unit 16 displays the content.
- the display unit 16 displays the content received by the reception unit 151.
- the sensor unit 17 detects predetermined information.
- the sensor unit 17 detects the user's intake information.
- the sensor unit 17 has a respiration sensor 171 as a means for detecting inspiration information indicating the inspiration of the user.
- the sensor unit 17 detects the inspiratory information by the respiration sensor 171.
- the sensor unit 17 detects the inspiratory information by the respiration sensor 171.
- the sensor unit 17 detects the inspiratory information by the respiration sensor 171 using the millimeter wave radar. Further, the sensor unit 17 is not limited to the millimeter wave radar, and may have a respiration sensor 171 having any configuration as long as it can detect the inspiratory information of the user.
- the respiration sensor 171 may be an image sensor.
- the respiration sensor 171 may be a wearable sensor. As the respiration sensor 171, either a contact type sensor or a non-contact type sensor may be used.
- the sensor unit 17 is not limited to the above, and may have various sensors.
- the sensor unit 17 may have a sensor (position sensor) that detects position information such as a GPS (Global Positioning System) sensor.
- the sensor unit 17 is not limited to the above, and may have various sensors.
- the light source unit 18 has a light source such as an LED (Light Emitting Diode).
- the light source unit 18 emits light.
- the light source unit 18 realizes a desired lighting mode.
- the light source unit 18 realizes a desired lighting mode according to the control by the execution unit 152.
- the light source unit 18 is turned on according to the control by the execution unit 152.
- the light source unit 18 is turned off according to the control by the execution unit 152.
- the light source unit 18 blinks according to the control by the execution unit 152.
- FIG. 9 is a flowchart showing a processing procedure of the information processing apparatus according to the embodiment of the present disclosure. Specifically, FIG. 9 is a flowchart showing a procedure of information processing by the server device 100.
- the server device 100 acquires the intake information indicating the intake of the user (step S101). Then, the server device 100 predicts whether or not the user speaks after the user's intake air based on the intake air information (step S102).
- FIG. 10 is a sequence diagram showing a processing procedure of the information processing system according to the embodiment of the present disclosure.
- the terminal device 10 detects the inspiratory information indicating the inspiratory air of the user (step S201). For example, the terminal device 10 acquires the user's inspiration information detected by the respiration sensor 171. Then, the terminal device 10 transmits the intake information indicating the user's intake to the server device 100 (step S202).
- the server device 100 predicts whether or not the user speaks after the user's intake air based on the intake air information acquired from the terminal device 10 (step S203). In the example of FIG. 10, the server device 100 predicts that the user speaks after the user's intake air based on the intake air information acquired from the terminal device 10.
- the server device 100 executes a voice recognition activation instruction to the terminal device 10 (step S204).
- the server device 100 instructs the terminal device 10 to execute the start of voice recognition by transmitting information instructing the start of voice recognition to the terminal device 10.
- the terminal device 10 executes the voice recognition activation process in response to the instruction from the server device 100 (step S205).
- the terminal device 10 outputs a voice corresponding to the activation of voice recognition (step S206).
- the terminal device 10 outputs voice corresponding to the activation of voice recognition and emits light.
- the terminal device 10 outputs a WakeUpResponse (start notification) corresponding to the activation of voice recognition by the output unit 13.
- the terminal device 10 executes a point corresponding to the activation of voice recognition by the light source unit 18.
- the server device 100 is not limited to the prediction process described above, and may perform various processes.
- the server device 100 may perform a classification process for classifying the user's intake air. This point will be described below. In the following description, the same points as in FIG. 1 will be omitted as appropriate.
- FIG. 11 is a diagram showing an example of processing using the intake classification result.
- FIG. 11 shows an abbreviation example of the activation word depending on the respiratory state.
- the server device 100 acquires the intake information indicating the intake before the utterance of the user U1. For example, the server device 100 acquires intake information indicating the intake of the user U1 from the terminal device 10 used by the user U1.
- the server device 100 performs the classification process using the intake information indicating the intake of the user U1 (step S301).
- the server device 100 calculates the score using the intake information.
- the server device 100 classifies the intake air of the user U1 by comparing the calculated score with the threshold value.
- the server device 100 classifies the intake air of the user U1 based on the magnitude relationship between the calculated score and each threshold value.
- FIG. 12 is a diagram showing an example of user's intake information.
- FIG. 13 is a diagram showing an example of prediction using the user's intake air.
- the graph GR2 in FIG. 12 is a graph showing the relationship between time and the amount of intake air, and the horizontal axis shows time and the vertical axis shows the amount of intake air. The same points as the graph GR1 in FIG. 2 will not be described with respect to the graph GR2.
- the maximum respiratory volume "B_max” in the graph GR2 indicates the maximum inspiratory volume (maximum inspiratory volume) reached by inspiration before utterance.
- the maximum inspiratory-speech time “T_bmax_uttr” indicates the interval (pre-speech time) from the time when the maximum inspiratory amount is reached to the time when the utterance is started (speech start time).
- the maximum inspiratory-speech time “T_bmax_uttr” indicates the difference between the time t2 indicating the time when the utterance was started (speech start time) and the time t1 indicating the time when the maximum inspiratory amount was reached.
- the increase amount "B_increase” in the graph GR2 indicates a change (increase amount) in the inspiratory volume before reaching the maximum respiratory volume "B_max".
- the increase amount "B_increase” may be a change (increase amount) in the intake amount at the time of acquisition (current time) of the intake information.
- the inspiratory information includes the increased amount "B_increase” in FIG. 12, the maximum respiration amount “B_max”, and the maximum inspiratory-speech time "T_bmax_uttr".
- the inspiratory information may not include the maximum inspiratory-utterance time "T_bmax_uttr”.
- the score may be calculated by setting "c * (1 / T_bmax_uttr)", which is the term (third term) related to the maximum inspiratory-utterance time "T_bmax_uttr", to "0".
- the maximum respiratory volume "B_max” may not be included.
- the server device 100 may predict the maximum respiration volume “B_max” as described with reference to FIG. 2 and calculate the score using the predicted maximum respiration volume “B_max”.
- the server device 100 uses the increase amount "B_increase”, the maximum respiration amount “B_max”, the maximum inspiratory-speech time “T_bmax_uttr”, and the following equation (2) to use the utterance score "Score_uttr", which is a score used for utterance prediction. Is calculated.
- the server device 100 classifies the intake of the user U1 into two thresholds, that is, the utterance presence / absence threshold value “Threshold_uttr” and the request-type utterance threshold value “Threshold_ask”.
- the server device 100 classifies the inspiration of the user U1 by comparing the utterance score “Score_uttr” with each threshold value. That is, the server device 100 classifies the utterance type according to the value of the utterance score "Score_uttr".
- the server device 100 classifies the inspiration of the user U1 by comparing the utterance score “Score_uttr” with the utterance presence / absence threshold value “Threshold_uttr” and the request-type utterance threshold value “Threshold_ask”. .. In the example of FIG. 13, the utterance presence / absence threshold value “Threshold_uttr” is smaller than the request-type utterance threshold value “Threshold_ask”.
- the server device 100 when the utterance score “Score_uttr” is larger than the request-type utterance threshold value “Threshold_ask”, the server device 100 considers the intake of the user U1 to be intake with a high possibility of request-type utterance (also referred to as “request-type intake”). ). In this case, the information processing system 1 activates the voice UI (voice recognition) with an explicit activation notification, and performs a normal flow (processing).
- voice UI voice recognition
- the server device 100 can request the user U1 to inhale. It is classified as inspiratory (also called “intermediate value”), which is considered to have sex but is not highly accurate.
- the information processing system 1 activates the voice recognition without an explicit activation notification, and activates the notification type response flow, for example.
- the server device 100 takes inspiration of the user U1 that is not expected to be uttered (also referred to as “completely unrequested inspiration”). ). In this case, the information processing system 1 does not activate the voice UI (voice recognition).
- the server device 100 classifies the inspiration of the user U1 by comparing the utterance score "Score_uttr” with the utterance presence / absence threshold value “Threshold_uttr” and the requested utterance threshold value "Threshold_ask”. It should be noted that each threshold value such as the utterance presence / absence threshold value “Threshold_uttr” and the request-type utterance threshold value “Threshold_ask” may be increased or decreased according to the change in the normal breathing range according to the change in the exercise state of the user.
- the utterance score "Score_uttr” is a value that takes into account the maximum respiration volume, the amount of increase, and the time between the maximum respiration and the utterance (maximum inspiratory-speech time).
- the server device 100 classifies the user's respiration by using the utterance score "Score_uttr". This allows the server device 100 to appropriately classify the user's intake air and use it for processing selection.
- step S301 the information processing system 1 performs the activation process and causes the user U1 to recognize the activation (step S311).
- the terminal device 10 causes the user U1 to recognize the activation by turning on the light source unit 18.
- the user U1 can recognize that the input by voice has become possible.
- the user U1 speaks (step S312). For example, the user U1 performs voice input requesting predetermined information from the terminal device 10.
- the information processing system 1 performs conventional processing (processing of the voice dialogue system) on the input by the user U1 (step S313).
- the information processing system 1 interprets a user's input and executes a corresponding process (Action) by natural language understanding (NLU).
- NLU natural language understanding
- the terminal device 10 processes the voice output in response to the request of the user U1 as "OK. Here's the result .
- the terminal device 10 processes the voice output in response to the request of the user U1 as "OK. Here's the result .
- NLU natural language understanding
- step S321 when the intake air of the user U1 is classified as an intermediate value in step S301, the information processing system 1 starts voice recognition without a response (response) (step S321).
- the user U1 speaks (step S322). For example, the user U1 performs voice input requesting predetermined information from the terminal device 10.
- the information processing system 1 executes the acquisition of the Intent (intention) by natural language understanding (NLU) without the response (response) (step S323).
- the information processing system 1 performs a notification type response (step S324).
- the terminal device 10 processes the voice output for the user U1 as "I have an idea for your ".
- the information processing system 1 may determine whether or not the notification is possible by determining whether the user is continuing the same topic or the conversation is continuing.
- the information processing system 1 does not start the activation process (step S331).
- the information processing system 1 does not activate voice recognition.
- the terminal device 10 does not activate voice recognition.
- the information processing system 1 can perform appropriate processing according to the pre-utterance breathing state by selecting the processing using the classification result of the user's inspiration. For example, the information processing system 1 can make it possible to omit the activation word depending on the pre-utterance breathing state by using the classification result of the user's inspiration.
- FIG. 14 is a diagram showing an example of processing using the intake classification result.
- FIG. 14 shows an example of switching between local / cloud voice recognition.
- the server device 100 acquires the intake information indicating the intake before the utterance of the user U1.
- the server device 100 acquires intake information indicating the intake of the user U1 from the terminal device 10 used by the user U1.
- the server device 100 performs the classification process using the intake information indicating the intake of the user U1 (step S401).
- the server device 100 calculates the score using the intake information.
- the server device 100 classifies the intake air of the user U1 by comparing the calculated score with the threshold value.
- the server device 100 classifies the intake air of the user U1 based on the magnitude relationship between the calculated score and each threshold value.
- FIG. 15 is a diagram showing an example of user's intake information.
- FIG. 16 is a diagram showing an example of prediction using the user's intake air.
- the graph GR3 in FIG. 15 is a graph showing the relationship between time and the amount of intake air, and the horizontal axis shows time and the vertical axis shows the amount of intake air.
- the same points as the graph GR1 in FIG. 2 and the graph GR2 in FIG. 12 will not be described with respect to the graph GR3.
- the maximum respiratory volume "B_max” in the graph GR3 indicates the maximum inspiratory volume (maximum inspiratory volume) reached by inspiration before utterance.
- the maximum inspiratory-speech time "T_bmax_uttr” indicates the interval from the time when the maximum inspiratory amount is reached to the time when the utterance is started (speech start time).
- the increase amount "B_increase” in the graph GR3 indicates a change (increase amount) in the inspiratory volume before reaching the maximum respiratory volume "B_max".
- the increase amount "B_increase” may be a change (increase amount) in the intake amount at the time of acquisition (current time) of the intake information.
- the inspiratory information includes the increased amount "B_increase” in FIG. 15, the maximum respiration amount “B_max”, and the maximum inspiratory-speech time "T_bmax_uttr".
- the inspiratory information may not include the maximum inspiratory-utterance time "T_bmax_uttr”.
- the score may be calculated by setting "c * (1 / T_bmax_uttr)", which is the term (third term) related to the maximum inspiratory-utterance time "T_bmax_uttr", to "0".
- the maximum respiratory volume "B_max” may not be included.
- the server device 100 may predict the maximum respiration volume “B_max” as described with reference to FIG. 2 and calculate the score using the predicted maximum respiration volume “B_max”.
- the server device 100 uses the increased amount "B_increase”, the maximum respiration amount “B_max”, the maximum inspiratory-speech time “T_bmax_uttr”, and the following equation (3) to predict the utterance length. Calculate "Score_uttr_length”.
- the above equation (3) is the same as the above equation (2), but the values of "a”, “b”, and “c” are different, and it is considered that the maximum intake amount is closely related.
- the coefficient “a” is relatively large as compared with the equation (2) of.
- the above equation (3) is an example of the calculation of the utterance length assumed score "Score_uttr_length”, and various mathematical formulas may be used for the calculation of the utterance length assumed score "Score_uttr_length”.
- the server device 100 classifies the intake of the user U1 into two thresholds, that is, the short sentence utterance threshold value "Threshold_uttr_short” and the long sentence utterance threshold value "Threshold_uttr_long”.
- the server device 100 classifies the intake air of the user U1 by comparing the utterance length assumed score “Score_uttr_length” with each threshold value. That is, the server device 100 classifies the utterance type according to the value of the utterance length assumed score "Score_uttr_length”.
- the server device 100 compares the assumed utterance length score “Score_uttr_length” with the short utterance threshold value “Throshold_uttr_shot” and the long utterance threshold value “Threshold_uttr_long”. Classify. In the example of FIG. 16, the short sentence utterance threshold value “Threshold_uttr_short” is smaller than the long sentence utterance threshold value “Threshold_uttr_long”.
- the server device 100 when the utterance length assumed score “Score_uttr_length” is larger than the long utterance threshold value “Threshold_uttr_long”, the server device 100 is classified as an inspiration with a high possibility of long utterance (also referred to as “long sentence type inhalation”). do.
- the information processing system 1 prepares to activate cloud voice recognition and processes the utterance. For example, the information processing system 1 activates the voice recognition of the server device 100, and the server device 100 processes the utterance.
- the server device 100 when the utterance length assumed score "Score_uttr_length" is equal to or less than the long utterance threshold “Threshold_uttr_long” and is larger than the short utterance threshold “Threshold_uttr_shot", the server device 100 is clearly long or short. It is classified as inhalation (also called “Chinese type inhalation"), which is difficult to guess.
- the information processing system 1 prepares, for example, both a cloud type and a local type.
- the information processing system 1 activates voice recognition of the server device 100 and the terminal device 10.
- the information processing system 1 uses local recognition at the initial stage of utterance and cloud recognition result as needed.
- the server device 100 when the utterance length assumed score “Score_uttr_length” is equal to or less than the short utterance threshold “Threshold_uttr_shot”, the server device 100 has a high possibility of short sentence utterance (also referred to as “short sentence type inhalation”). Classify into.
- the information processing system 1 prepares to activate the local voice recognition and processes the utterance. For example, the information processing system 1 activates the voice recognition of the terminal device 10, and the terminal device 10 processes the utterance.
- the server device 100 classifies the intake of the user U1 by comparing the utterance length assumed score "Score_uttr_length” with the short utterance threshold “Threshold_uttr_shot” and the long utterance threshold “Threshold_uttr_long”.
- each threshold value such as the short sentence utterance threshold value "Threshold_uttr_short” and the long sentence utterance threshold value "Threshold_uttr_long” may be increased or decreased according to the change of the normal breathing range according to the change of the exercise state of the user.
- the utterance length assumed score "Score_uttr_length” is a value that takes into account the maximum respiration volume, the amount of increase, and the time between utterances (maximum inspiration-speech time) from the maximum respiration.
- the server device 100 classifies the user's respiration by using the utterance length assumed score “Score_uttr_length”. This allows the server device 100 to appropriately classify the user's intake air and use it for processing selection.
- step S401 the information processing system 1 starts cloud CL (server type) large vocabulary speech recognition (step S411).
- cloud CL server type
- the information processing system 1 connects the server device 100 and the terminal device 10 by WebSocket or the like.
- the information processing system 1 performs processing for the utterance of the user U1 by using the cloud (server type) large vocabulary speech recognition (step S413).
- the information processing system 1 uses cloud (server type) large vocabulary speech recognition to process a user's utterance. As a result, the information processing system 1 can improve the long sentence performance by the large vocabulary speech recognition.
- the information processing system 1 prepares both the cloud CL and the local (step S421). For example, the information processing system 1 activates voice recognition of the server device 100 and the terminal device 10.
- the information processing system 1 responds with a high-response local result as the initial response during the utterance (step S423).
- the information processing system 1 makes an initial response by voice recognition of the terminal device 10.
- the information processing system 1 changes to the cloud CL result when the utterance length exceeds a certain level (step S424). For example, when the information processing system 1 exceeds a certain utterance length, the information processing system 1 changes to a response by voice recognition of the server device 100. In this way, the information processing system 1 processes the initial response locally, and processes it on the cloud CL side, which has a large amount of backup data in the case of a long sentence.
- the information processing system 1 performs the activation process of the local terminal voice recognition (step S431). In the example of FIG. 14, the information processing system 1 performs a process of activating the voice recognition of the terminal device 10.
- the information processing system 1 processes the utterance by voice recognition of the terminal device 10 (step S433).
- the information processing system 1 is capable of high response response and does not require data communication.
- the information processing system 1 can perform appropriate processing according to the pre-utterance breathing state by selecting the processing using the classification result of the user's inspiration. For example, the information processing system 1 can switch (select) between local and cloud speech recognition using the classification result of the user's intake air. As a result, the information processing system 1 can take into account the high response response of local voice recognition and the need for data communication, and can appropriately use the long sentence recognition performance of cloud voice recognition according to the conditions.
- FIG. 17 is a diagram showing an example of processing using the intake classification result.
- FIG. 17 shows a modified example of the voice recognition dictionary.
- the server device 100 acquires the intake information indicating the intake before the utterance of the user U1.
- the server device 100 acquires intake information indicating the intake of the user U1 from the terminal device 10 used by the user U1.
- the server device 100 performs the classification process using the intake information indicating the intake of the user U1 (step S501).
- the server device 100 calculates the score using the intake information.
- the server device 100 classifies the intake air of the user U1 by comparing the calculated score with the threshold value.
- the server device 100 classifies the intake air of the user U1 based on the magnitude relationship between the calculated score and each threshold value.
- FIG. 18 is a diagram showing an example of user's intake information.
- FIG. 19 is a diagram showing an example of prediction using the user's intake air.
- the graph GR4 in FIG. 18 is a graph showing the relationship between time and the amount of intake air, and the horizontal axis shows time and the vertical axis shows the amount of intake air.
- the same points as the graph GR1 in FIG. 2 and the graph GR2 in FIG. 12 will not be described with respect to the graph GR4.
- the maximum respiratory volume "B_max” in the graph GR4 indicates the maximum inspiratory volume (maximum inspiratory volume) reached by inspiration before utterance.
- the maximum inspiratory-speech time "T_bmax_uttr” indicates the interval from the time when the maximum inspiratory amount is reached to the time when the utterance is started (speech start time).
- the increase amount "B_increase” in the graph GR4 indicates the change (increase amount) of the inspiratory volume before reaching the maximum respiratory volume "B_max".
- the increase amount "B_increase” may be a change (increase amount) in the intake amount at the time of acquisition (current time) of the intake information.
- the inspiratory information includes the increased amount "B_increase” in FIG. 18, the maximum respiration amount “B_max”, and the maximum inspiratory-speech time "T_bmax_uttr".
- the inspiratory information may not include the maximum inspiratory-utterance time "T_bmax_uttr”.
- the score may be calculated by setting "c * (1 / T_bmax_uttr)", which is the term (third term) related to the maximum inspiratory-utterance time "T_bmax_uttr", to "0".
- the maximum respiratory volume "B_max” may not be included.
- the server device 100 may predict the maximum respiration volume “B_max” as described with reference to FIG. 2 and calculate the score using the predicted maximum respiration volume “B_max”.
- the server device 100 uses the increased amount "B_increase”, the maximum respiration volume “B_max”, the maximum inspiratory-speech time “T_bmax_uttr”, and the following equation (4) to predict the utterance length. Calculate "Score_uttr_length”.
- the above equation (4) indicates a predetermined constant.
- the above equation (4) is the same as the above equation (2), but the values of "a”, “b”, and “c” are different, and it is considered that the maximum intake amount is closely related.
- the coefficient “a” is relatively large as compared with the equation (2) of.
- the above equation (4) is an example of the calculation of the utterance length assumed score "Score_uttr_length”, and various mathematical formulas may be used for the calculation of the utterance length assumed score "Score_uttr_length”.
- the server device 100 classifies the intake of the user U1 into two thresholds, that is, the short sentence utterance threshold value "Threshold_uttr_short” and the long sentence utterance threshold value "Threshold_uttr_long”.
- the server device 100 classifies the intake air of the user U1 by comparing the utterance length assumed score “Score_uttr_length” with each threshold value. That is, the server device 100 classifies the utterance type according to the value of the utterance length assumed score "Score_uttr_length”.
- the server device 100 compares the assumed utterance length score “Score_uttr_length” with the short utterance threshold value “Throshold_uttr_shot” and the long utterance threshold value “Threshold_uttr_long”. Classify. In the example of FIG. 19, the short sentence utterance threshold value “Threshold_uttr_short” is smaller than the long sentence utterance threshold value “Threshold_uttr_long”.
- the server device 100 when the utterance length assumed score “Score_uttr_length” is larger than the long utterance threshold value “Threshold_uttr_long”, the server device 100 is classified as an inspiration with a high possibility of long utterance (also referred to as “long sentence type inhalation”). do.
- the information processing system 1 prepares a long sentence type speech recognition dictionary. For example, the server device 100 acquires the information of the long sentence dictionary among the dictionaries stored in the storage unit 120, and performs the voice recognition process using the acquired information.
- the server device 100 when the utterance length assumed score "Score_uttr_length" is equal to or less than the long utterance threshold “Threshold_uttr_long” and is larger than the short utterance threshold “Threshold_uttr_shot", the server device 100 is clearly long or short. It is classified as inhalation (also called “Chinese type inhalation”), which is difficult to guess.
- the information processing system 1 prepares a Chinese-style speech recognition dictionary. For example, the server device 100 acquires the information of the Chinese dictionary among the dictionaries stored in the storage unit 120, and performs the voice recognition process using the acquired information.
- the server device 100 when the utterance length assumed score “Score_uttr_length” is equal to or less than the short utterance threshold “Threshold_uttr_shot”, the server device 100 has a high possibility of short-sentence utterance (also referred to as “short-sentence type inhalation”). Classify into.
- the information processing system 1 prepares a short sentence type speech recognition dictionary (word / phrase). For example, the server device 100 acquires the information of the word / phrase dictionary from the dictionaries stored in the storage unit 120, and performs the voice recognition process using the acquired information.
- the server device 100 classifies the intake of the user U1 by comparing the utterance length assumed score "Score_uttr_length” with the short utterance threshold “Threshold_uttr_shot” and the long utterance threshold “Threshold_uttr_long”.
- each threshold value such as the short sentence utterance threshold value "Threshold_uttr_short” and the long sentence utterance threshold value "Threshold_uttr_long” may be increased or decreased according to the change of the normal breathing range according to the change of the exercise state of the user.
- the utterance length assumed score "Score_uttr_length” is a value that takes into account the maximum respiration volume, the amount of increase, and the time between utterances (maximum inspiration-speech time) from the maximum respiration.
- the server device 100 classifies the user's respiration by using the utterance length assumed score “Score_uttr_length”. This allows the server device 100 to appropriately classify the user's intake air and use it for processing selection.
- the information processing system 1 selects the long sentence dictionary (step S511). For example, the server device 100 selects the information of the long sentence dictionary from the dictionaries stored in the storage unit 120.
- the information processing system 1 acquires the voice recognition result using the selected dictionary (step S542). For example, the information processing system 1 acquires a voice recognition result using a long-sentence dictionary.
- the Chinese dictionary is selected (step S521).
- the server device 100 selects the information of the Chinese dictionary from the dictionaries stored in the storage unit 120.
- the information processing system 1 acquires the voice recognition result using the selected dictionary (step S542). For example, the information processing system 1 acquires a voice recognition result using a Chinese dictionary.
- a dictionary for words / phrases is selected (step S521).
- the server device 100 selects the information of the word / phrase dictionary for short sentences from the dictionaries stored in the storage unit 120.
- the information processing system 1 acquires the voice recognition result using the selected dictionary (step S542). For example, the information processing system 1 acquires a voice recognition result using a word / phrase dictionary.
- the information processing system 1 can improve the voice recognition performance by changing the dictionary used for voice recognition according to the utterance length.
- the accuracy may decrease when uttering one word or word by word.
- the recognition performance may be significantly reduced.
- the information processing system 1 estimates whether the utterance is short or long from the respiratory state and changes the speech recognition engine dictionary. As described above, the information processing system 1 can suppress the above-mentioned deterioration in performance by selecting the dictionary according to the classification of the intake air.
- the information processing system 1 is not limited to the above-mentioned example, and various information and processing may be selected based on the classification of the long sentence type intake to the short sentence type intake. This point will be described with reference to FIG. FIG. 20 is a diagram showing an example of processing using the intake classification result.
- FIG. 20 shows a case where the UI selected from the estimation of the utterance length depending on the inspiratory state is changed.
- the information processing system 1 selects a suitable UI element according to the expected utterance amount even if the UI element is randomly laid out.
- the content CT1 in FIG. 20 is displayed, for example, on the display unit 16 (screen) of the terminal device 10.
- the element EL1 corresponding to the ID, the element EL2 corresponding to the Title, and the element EL3 corresponding to the MessageBody (text) are randomly arranged.
- the ID is assumed to be a short input such as a number.
- the title will be input in a Chinese sentence of about several words.
- a long sentence such as a free sentence is input. Therefore, as shown in FIG. 20, the area occupied by the element EL1, the element EL2, and the element EL3 increases in this order.
- the server device 100 performs the classification process using the intake information indicating the intake of the user U1 (step S601). Since step S601 is the same as step S501, the description thereof will be omitted.
- the information processing system 1 selects the element EL3 corresponding to the MessageBody from the elements EL1 to EL3 in the content CT1 (step S611). For example, the terminal device 10 selects the element EL3 corresponding to the MessageBody as an input target.
- the information processing system 1 selects the element EL2 corresponding to the title from the elements EL1 to EL3 in the content CT1 (step S621). For example, the terminal device 10 selects the element EL2 corresponding to the Title as an input target.
- the information processing system 1 selects the element EL1 corresponding to the ID from the elements EL1 to EL3 in the content CT1 (step S631). For example, the terminal device 10 selects the element EL1 corresponding to the ID as an input target.
- the information processing system 1 may determine an element by appropriately using various information.
- the information processing system 1 automatically determines the input destination when the UI element is uniquely determined, but there are multiple UI elements that are expected to be input of the same length, or the system cannot automatically determine the input destination.
- the input element may be determined by performing a process such as inquiring to the user.
- FIGS. 21 and 22 are diagrams showing an example of processing using the intake classification result.
- FIG. 21 shows an example of changing the system response (Text-To-Speech).
- FIG. 21 shows an example of a response change at the time of Wu (Wake up Word).
- the server device 100 acquires the intake information indicating the intake before the utterance of the user U1.
- the server device 100 acquires intake information indicating the intake of the user U1 from the terminal device 10 used by the user U1.
- step S701 Inhalation before utterance of user U1 is performed (step S701), and then user U1 utters WUW (step S702).
- the server device 100 performs the classification process using the intake information of the user U1.
- the server device 100 calculates the score using the intake information.
- the server device 100 classifies the intake air of the user U1 by comparing the calculated score with the threshold value.
- the server device 100 classifies the intake air of the user U1 based on the magnitude relationship between the calculated score and each threshold value.
- FIG. 23 is a diagram showing an example of user's intake information.
- FIG. 24 is a diagram showing an example of prediction using the user's intake air.
- the graph GR5 in FIG. 23 is a graph showing the relationship between time and the amount of intake air, and the horizontal axis shows time and the vertical axis shows the amount of intake air.
- the same points as the graph GR1 in FIG. 2 and the graph GR2 in FIG. 12 will not be described with respect to the graph GR5.
- the maximum respiratory volume "B_max” in the graph GR5 indicates the maximum inspiratory volume (maximum inspiratory volume) reached by inspiration before utterance.
- the maximum inspiratory-speech time "T_bmax_uttr” indicates the interval from the time when the maximum inspiratory amount is reached to the time when the utterance is started (speech start time).
- the increase amount "B_increase” in the graph GR5 indicates a change (increase amount) in the inspiratory volume before reaching the maximum respiratory volume "B_max".
- the increase amount "B_increase” may be a change (increase amount) in the intake amount at the time of acquisition (current time) of the intake information.
- the inspiratory information includes the increased amount "B_increase” in FIG. 23, the maximum respiration amount “B_max”, and the maximum inspiratory-speech time "T_bmax_uttr".
- the inspiratory information may not include the maximum inspiratory-utterance time "T_bmax_uttr”.
- the score may be calculated by setting "c * (1 / T_bmax_uttr)", which is the term (third term) related to the maximum inspiratory-utterance time "T_bmax_uttr", to "0".
- the maximum respiratory volume "B_max” may not be included.
- the server device 100 may predict the maximum respiration volume “B_max” as described with reference to FIG. 2 and calculate the score using the predicted maximum respiration volume “B_max”.
- the server device 100 uses the increased amount "B_increase”, the maximum respiration amount “B_max”, the maximum inspiratory-speech time “T_bmax_uttr”, and the following formula (5) to use the utterance rush score "Score_hurry", which is a score used for utterance prediction. Is calculated.
- the above equation (5) is the same as the above equations (2) to (4), but the values of "a”, “b” and “c” are different, and the sharp rise in intake air is greatly reflected. Therefore, the coefficient "b” is relatively large as compared with the above equations 2) to (4).
- the above equation (5) is an example of the calculation of the utterance rush score "Score_hurry”, and various mathematical formulas may be used for the calculation of the utterance rush score "Score_hurry”.
- the server device 100 classifies the intake air of the user U1 into two threshold values, that is, a hurry low threshold value “Threshold_hurry_low” and a hurry high threshold value “Threshold_hurry_high”.
- the server device 100 classifies the inspiration of the user U1 by comparing the utterance rush score “Score_hurry” with each threshold value. That is, the server device 100 classifies the utterance type according to the value of the utterance rush score "Score_hurry”.
- the server device 100 classifies the inspiration of the user U1 by comparing the utterance rush score “Score_hurry” with the rush low threshold value “Throshold_hurry_low” and the hurry high threshold value “Threshold_hurry_high”. ..
- the rush low threshold “Threshold_hurry_low” is smaller than the hurry high threshold “Threshold_hurry_high”.
- the server device 100 classifies the inhalation with a high possibility of long utterance (also referred to as “shortest type inhalation”).
- the information processing system 1 predicts that the user wants the shortest processing and executes the shortest processing. For example, the information processing system 1 shortens the TTS (Text-To-Speech) utterance and outputs an SE (sound effect) when the user is rushing to execute a task.
- TTS Textt-To-Speech
- the server device 100 when the utterance rush score “Score_hurry” is hurriedly higher than the high threshold value “Throshold_hurry_high” and larger than the hurry low threshold value “Throshold_hurry_low”, the server device 100 is clearly long or short. It is classified as an intake that is difficult to guess (also called “intermediate intake”). In this case, the information processing system 1 predicts that the user wants an intermediate process between the shortest and the normal, and executes the intermediate process. For example, the information processing system 1 summarizes and presents TTS utterance sentences according to the value of the utterance rush score "Score_hurry". The details of the intermediate processing will be described later.
- the server device 100 when the utterance rush score “Score_hurry” is urgently lower than the low threshold value “Threshold_hurry_low”, the server device 100 is classified as an inhalation with a high possibility of short sentence utterance (also referred to as “normal inhalation”). do.
- the information processing system 1 predicts that the user desires normal processing and executes the shortest processing. For example, the information processing system 1 executes a TTS utterance in which the information is transmitted to the user in the most detail because the utterance is not particularly urgent by the user.
- the server device 100 classifies the inspiration of the user U1 by comparing the utterance rush score "Score_hurry” with the rush low threshold “Threshold_hurry_low” and the hurry high threshold “Threshold_hurry_high”. It should be noted that each threshold value such as the hurry low threshold value “Threshold_hurry_low” and the hurry high threshold value “Threshold_hurry_high” may be increased or decreased according to the change in the normal breathing range according to the change in the exercise state of the user.
- the utterance rush score "Score_hurry” is a value that takes into account the maximum respiration volume, the amount of increase, and the time between the maximum respiration and the utterance (maximum inspiratory-speech time).
- the server device 100 classifies the user's breathing by using the utterance rush score "Score_hurry”. This allows the server device 100 to appropriately classify the user's intake air and use it for processing selection. It should be noted that the respiratory state and the speed of utterance may be combined for judgment, but details on this point will be described later.
- the information processing system 1 predicts that the user U1 desires the normal processing and selects the normal processing (step S711).
- the terminal device 10 outputs in a normal process such as "How can I help?". Then, the user U1 speaks (step S731).
- the information processing system 1 predicts that the user U1 desires the shortest processing and selects the shortest processing (step S721).
- the terminal device 10 outputs only a predetermined SE (sound effect). Then, the user U1 speaks (step S731).
- the intake air of the user U1 is classified as an intermediate type intake air, it is predicted that the user desires an intermediate processing between the shortest and the normal, and the intermediate processing is executed. This point will be described below.
- the information processing system 1 When the inspiration of the user U1 is classified as an intermediate type inspiration, the information processing system 1 summarizes and presents the TTS utterance sentence according to the value of the utterance rush score "Score_hurry”. For example, the information processing system 1 summarizes TTS utterances using the value of the utterance rush score “Score_hurry”.
- the information processing system 1 may calculate the utterance rush score "Score_hurry", which is the score used for utterance prediction, by using the following equation (6) instead of the above equation (5).
- V_uttr indicates an index (number of utterances per unit time) of how many characters are spoken per hour, and is calculated using, for example, the following formula (7).
- FIG. 25 is a diagram showing an example of the relationship between the length of the user's utterance and the number of characters.
- the utterance UT in FIG. 25 conceptually indicates the utterance by the user.
- FIG. 25 shows that the utterance UT was performed from the start time “T_uttr_start” to the end time “T_uttr_end”. That is, "T_uttr_end-T_uttr_start” in the formula (7), which is a value obtained by subtracting the start time "T_uttr_start” from the end time "T_uttr_end”, indicates the length of the utterance.
- "Character number of the utterance” in the formula (7) indicates the number of words included in the utterance UT.
- V_uttr indicates an index (number of utterances per unit time) of how many characters are spoken per hour in the utterance UT. For example, when “V_uttr” is large, it indicates that the utterance is early, and when “V_uttr” is small, it indicates that the utterance is slow.
- the information processing system 1 may add the utterance speed to the calculation of the utterance rush score “Score_hurry”.
- the information processing system 1 summarizes TTS utterances using the utterance rush score "Score_hurry" calculated using either the above formula (5) or the above formula (6).
- the information processing system 1 may summarize sentences, or may use an API (Application Programming Interface) or the like provided by an external service.
- the information processing system 1 may calculate the shortening target value using the following formula (8).
- the information processing system 1 summarizes the TTS utterance based on the value of "Abbrev_target". For example, the information processing system 1 summarizes TTS utterances using the following equation (9).
- Shorten_API indicates a predetermined function (API) used for summarization generation.
- original_response indicates the TTS response before summarization.
- Response_abbrev in the above formula (9) shows a summary of TTS utterances output by Shorten_API. In this case, the information processing system 1 uses the "Response_abbrev” output by Shorten_API as the TTS summary.
- the information processing system 1 when the intake air of the user U1 is classified as an intermediate type intake air, the information processing system 1 outputs "Response_abbrev" output by Shorten_API.
- the terminal device 10 outputs the TTS summary corresponding to "Response_abbrev”.
- the information processing system 1 estimates from the inspiratory state before the utterance what speed the user wants to turn take, and adjusts the TTS response length. Further, the information processing system 1 switches to a short TTS response or a SE response when the user wants to complete the task in a hurry, and shortens the task achievement time. As a result, the information processing system 1 can improve usability. Note that some sentences may not be summarized to the expected length. In that case, the information processing system 1 may adjust the playback speed of the TTS to shorten the time.
- FIG. 22 shows an example of changing the system response (Text-To-Speech). Specifically, FIG. 22 shows an example of an action response change at the time of receiving an utterance.
- the server device 100 acquires the intake information indicating the intake before the utterance of the user U1.
- the server device 100 acquires intake information indicating the intake of the user U1 from the terminal device 10 used by the user U1.
- step S801 Inhalation before utterance of user U1 is performed (step S801), and then user U1 utters WUW (step S802).
- the server device 100 performs the classification process using the intake information of the user U1.
- the server device 100 calculates the score using the intake information.
- the server device 100 classifies the intake air of the user U1 by comparing the calculated score with the threshold value.
- the server device 100 classifies the intake air of the user U1 based on the magnitude relationship between the calculated score and each threshold value. Since the classification process is the same as in FIG. 21, the description thereof will be omitted.
- the information processing system 1 predicts that the user U1 desires the normal processing and selects the normal processing (step S811).
- the terminal device 10 displays information on the display DP (display unit 16) and outputs information in a normal process such as "OK, here's the result. One new movie and two music".
- the terminal device 10 displays the information for the user's request and also outputs the voice related to the information (TTS utterance).
- the information processing system 1 predicts that the user U1 desires the shortest processing and selects the shortest processing (step S821).
- the terminal device 10 displays information on the display DP (display unit 16) and outputs only a predetermined SE (sound effect).
- the terminal device 10 displays the information for the user's request and outputs only the notification sound to the user.
- the intake air of the user U1 is classified as an intermediate type intake air, it is predicted that the user desires an intermediate processing between the shortest and the normal, and the intermediate processing as described above is executed.
- the terminal device 10 displays information for the user's request and also outputs a voice summarizing the TTS utterance.
- the shortest response is SE.
- the shortest response may be the minimum amount of TTS utterance in which the state can be known.
- the information processing system 1 estimates from the inspiratory state before utterance what speed the user wants to turn take, and when in a hurry, summarizes and shortens the TTS response after executing the action. Or notify by SE. As a result, the information processing system 1 can improve usability.
- the terminal device 10 may perform the prediction process, the classification process, and the like. That is, the terminal device 10 which is a device on the client side may be an information processing device that performs the above-mentioned prediction processing and classification processing.
- the system configuration of the information processing system 1 is not limited to the configuration in which the server device 100, which is a device on the server side, performs prediction processing and classification processing, and the terminal device 10 which is a device on the client side performs the prediction processing described above. It may be configured to perform classification processing.
- the information processing system 1 performs utterance prediction and intake classification on the client side (terminal device 10). Then, the server side (server device 100) acquires the information of the prediction result and the classification result from the terminal device 10 and performs various processes.
- the terminal device 10 may have a prediction unit that realizes the same function as the prediction unit 132 described above, and a selection unit that realizes the same function as the selection unit 133. Further, in this case, the server device 100 does not have to have the prediction unit 132 and the selection unit 133.
- the information processing system 1 may have a system configuration in which the client side (terminal device 10) predicts the speech and the server side (server device 100) classifies the intake air.
- the terminal device 10 which is a device on the client side may be an information processing device which performs the above-mentioned prediction processing
- the server device 100 which is a device on the server side may be an information processing device which performs the above-mentioned classification processing.
- the prediction unit of the terminal device 10 performs the prediction process
- the prediction unit 132 of the server device 100 performs the classification process.
- the information processing system 1 may have a system configuration in which either the client-side device (terminal device 10) or the server-side device (server device 100) performs each process.
- each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured.
- the information processing device (server device 100 in the embodiment) according to the present disclosure includes an acquisition unit (acquisition unit 131 in the embodiment) and a prediction unit (prediction unit 132 in the embodiment).
- the acquisition unit acquires inspiration information indicating the inspiration of the user.
- the prediction unit predicts whether or not the user speaks after the user's inspiration based on the inspiration information acquired by the acquisition unit.
- the information processing device predicts whether or not the user speaks after the user's inspiration based on the inspiration information indicating the user's inspiration. In this way, the information processing device can appropriately predict whether or not the user has spoken by predicting whether or not there is a subsequent utterance of the user based on the state of the user's inspiration.
- the acquisition unit acquires inspiration information including the amount of increase in the inspiration of the user.
- the prediction unit predicts whether or not the user speaks after inspiration based on the amount of increase. In this way, the information processing device can accurately predict the presence or absence of the user's utterance by using the increased amount of the user's intake air.
- the acquisition unit acquires intake information including the intake amount of the user's intake.
- the prediction unit predicts whether or not the user speaks after the inspiration based on the inspiratory amount. In this way, the information processing device can accurately predict the presence or absence of the user's utterance by using the intake amount of the user's intake air.
- the acquisition unit acquires inspiration information including the initial inspiration amount at the start of inspiration of the user.
- the prediction unit predicts whether or not the user speaks after the inspiration based on the initial inspiration amount. In this way, the information processing device can accurately predict the presence or absence of the user's utterance by using the initial intake amount of the user's intake air.
- the prediction unit predicts whether or not the user speaks after inspiration using the score calculated based on the inspiration information. In this way, the information processing device can appropriately predict whether or not the user is speaking by using the score calculated based on the intake information.
- the prediction unit predicts that the user speaks after inspiration when the comparison result between the score and the threshold value satisfies a predetermined condition.
- the information processing apparatus can appropriately predict the presence or absence of the user's utterance by predicting the user's utterance based on the comparison between the score and the threshold value.
- the information processing device includes an execution unit (execution unit 134 in the embodiment).
- the execution unit executes processing according to the prediction result by the prediction unit. In this way, the information processing device can execute an appropriate process according to whether or not the user speaks by executing the process according to the prediction result of the presence or absence of the user's utterance.
- the execution unit executes preprocessing related to voice recognition before the user's inspiration ends.
- the information processing device can prepare for voice recognition prior to the user's speech by executing preprocessing related to voice recognition before the end of intake. , Usability can be improved.
- the execution unit executes pre-processing related to voice recognition when the prediction unit predicts that the user will speak after inhalation. In this way, when the user is predicted to speak, the information processing device can prepare for voice recognition according to the prediction by executing preprocessing related to voice recognition, and can improve usability. ..
- the execution unit executes the pre-processing before the user's intake is completed.
- the information processing device can prepare for voice recognition prior to the user's speech by executing preprocessing related to voice recognition before the end of intake. , Usability can be improved.
- the prediction unit classifies the user's intake air based on the intake air information.
- the information processing apparatus can execute the subsequent processing by classifying the user's intake air and using the result of classifying the user's intake air state.
- the acquisition unit acquires intake information including the maximum intake amount of the user's intake.
- the predictor classifies the user's inspiration based on the maximum inspiratory volume. In this way, the information processing device can accurately classify the user's intake air by using the maximum intake amount of the user's intake air.
- the acquisition unit acquires time point information indicating the start time of utterance after the user's inspiration.
- the predictor classifies the user's inspiration based on the interval between the time of maximum inspiration and the time of utterance start. In this way, the information processing device can accurately classify the user's inspiration by using the information of the interval between the time of the maximum intake amount and the time of the start of utterance.
- the acquisition unit acquires utterance information including the length and number of characters of the utterance after the user's inhalation.
- the predictor classifies the user's inspiration based on the length of the utterance and the number of characters. In this way, the information processing device can accurately classify the user's inspiration by using the length of the utterance after the user's inspiration and the number of characters.
- the prediction unit classifies the user's intake air into at least one of a plurality of types including the requested type intake and the unrequested type intake.
- the information processing apparatus can appropriately classify the user's intake state by classifying the user's intake into one of a plurality of types including the requested type intake and the unrequested type intake.
- the prediction unit classifies the user's inspiration into at least one of a plurality of types including long-sentence inspiration and short-sentence inspiration.
- the information processing apparatus can appropriately classify the user's intake state by classifying the user's intake into one of a plurality of types including the long sentence type intake and the short sentence type intake.
- the prediction unit classifies the user's intake air into at least one of a plurality of types including a normal processing desired type intake and a shortened processing desired type intake.
- the information processing apparatus can appropriately classify the user's intake state by classifying the user's intake air into one of a plurality of types including the normal processing desired type intake and the shortened processing desired type intake. ..
- the information processing apparatus includes a selection unit (selection unit 133 in the embodiment).
- the selection unit performs selection processing according to the classification result by the prediction unit. In this way, the information processing device can make an appropriate selection according to whether or not the user speaks by executing the selection process according to the prediction result of the presence or absence of the user's utterance.
- the selection unit selects the process to be executed according to the classification result by the prediction unit.
- the information processing apparatus can appropriately select the process to be executed according to whether or not the user speaks by selecting the process to be executed according to the classification result by the prediction unit.
- the selection unit selects information to be used for processing the user's utterance according to the classification result by the prediction unit.
- the information processing device appropriately selects the information to be used according to whether or not the user speaks by selecting the information used for processing the user's utterance according to the classification result by the prediction unit. Can be done.
- FIG. 26 is a hardware configuration diagram showing an example of a computer 1000 that realizes the functions of the information processing device.
- the computer 1000 includes a CPU 1100, a RAM 1200, a ROM (Read Only Memory) 1300, an HDD (Hard Disk Drive) 1400, a communication interface 1500, and an input / output interface 1600.
- Each part of the computer 1000 is connected by a bus 1050.
- the CPU 1100 operates based on the program stored in the ROM 1300 or the HDD 1400, and controls each part. For example, the CPU 1100 expands the program stored in the ROM 1300 or the HDD 1400 into the RAM 1200 and executes processing corresponding to various programs.
- the ROM 1300 stores a boot program such as a BIOS (Basic Input Output System) executed by the CPU 1100 when the computer 1000 is started, a program that depends on the hardware of the computer 1000, and the like.
- BIOS Basic Input Output System
- the HDD 1400 is a computer-readable recording medium that non-temporarily records a program executed by the CPU 1100 and data used by the program.
- the HDD 1400 is a recording medium for recording an information processing program according to the present disclosure, which is an example of program data 1450.
- the communication interface 1500 is an interface for the computer 1000 to connect to an external network 1550 (for example, the Internet).
- the CPU 1100 receives data from another device or transmits data generated by the CPU 1100 to another device via the communication interface 1500.
- the input / output interface 1600 is an interface for connecting the input / output device 1650 and the computer 1000.
- the CPU 1100 receives data from an input device such as a keyboard or mouse via the input / output interface 1600. Further, the CPU 1100 transmits data to an output device such as a display, a speaker, or a printer via the input / output interface 1600. Further, the input / output interface 1600 may function as a media interface for reading a program or the like recorded on a predetermined recording medium (media).
- the media is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.
- an optical recording medium such as a DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk)
- a magneto-optical recording medium such as an MO (Magneto-Optical disk)
- a tape medium such as a magnetic tape
- magnetic recording medium such as a magnetic tape
- semiconductor memory for example, an optical recording medium such as a DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.
- the CPU 1100 of the computer 1000 realizes the functions of the control unit 130 and the like by executing the information processing program loaded on the RAM 1200.
- the information processing program according to the present disclosure and the data in the storage unit 120 are stored in the HDD 1400.
- the CPU 1100 reads the program data 1450 from the HDD 1400 and executes the program, but as another example, these programs may be acquired from another device via the external network 1550.
- the present technology can also have the following configurations.
- An acquisition unit that acquires intake information indicating the user's intake, Based on the intake information acquired by the acquisition unit, a prediction unit that predicts whether or not the user speaks after the intake of the user, and a prediction unit.
- Information processing device equipped with (2) The acquisition unit The intake information including the increase amount of the intake of the user is acquired, and the intake information is acquired. The prediction unit Based on the increase amount, it is predicted whether or not the user speaks after the inspiration. The information processing device according to (1).
- (3) The acquisition unit The intake information including the intake amount of the intake of the user is acquired, and the intake information is acquired.
- the prediction unit Based on the intake amount, it is predicted whether or not the user speaks after the intake.
- the acquisition unit The intake information including the initial intake amount at the start time of the intake of the user is acquired, and the intake information is acquired.
- the prediction unit Based on the initial intake amount, it is predicted whether or not the user speaks after the intake.
- the information processing device according to any one of (1) to (3).
- the prediction unit Using the score calculated based on the inspiration information, it is predicted whether or not the user speaks after the inspiration.
- the information processing device according to any one of (1) to (4).
- (6) The prediction unit When the comparison result between the score and the threshold value satisfies a predetermined condition, it is predicted that the user speaks after the inspiration.
- the information processing device according to (5).
- Execution unit that executes processing according to the prediction result by the prediction unit, The information processing device according to any one of (1) to (6).
- the execution unit The information processing apparatus according to (7), wherein when the prediction unit predicts that the user will speak after the inspiration, the pre-processing related to voice recognition is executed.
- the execution unit The information processing apparatus according to (8), wherein the pre-processing is executed before the intake of the user is completed.
- the prediction unit Based on the intake information, the user's intake is classified.
- the information processing device according to any one of (1) to (9).
- (11) The acquisition unit The intake information including the maximum intake amount of the intake of the user is acquired, and the intake information is acquired.
- the prediction unit The user's intake is classified based on the maximum intake amount.
- the information processing device according to (10). (12) The acquisition unit Acquire the time point information indicating the utterance start time point after the inspiration of the user, and obtain the time point information. The prediction unit The user's inspiration is classified based on the interval between the time of the maximum inspiration amount and the time of the start of utterance. The information processing device according to (11). (13) The acquisition unit The utterance information including the length and the number of characters of the utterance after the inspiration of the user is acquired, and the utterance information is acquired. The prediction unit The user's inspiration is classified based on the length of the utterance and the number of characters. The information processing device according to any one of (10) to (12).
- the prediction unit The user's intake is classified into at least one of a plurality of types including solicited and unsolicited inhalations.
- the information processing device according to any one of (10) to (13).
- the prediction unit The user's inspiration is classified into at least one of a plurality of types including a long-sentence inspiration and a short-sentence inspiration.
- the information processing device according to any one of (10) to (13).
- (16) The prediction unit The user's intake is classified into at least one of a plurality of types including a normal processing desired type intake and a shortened processing desired type intake.
- the information processing device according to any one of (10) to (13).
- a selection unit that performs selection processing according to the classification result by the prediction unit The information processing apparatus according to any one of (10) to (16). (18) The selection unit The process to be executed is selected according to the classification result by the prediction unit. The information processing device according to (17). (19) The selection unit Information to be used for processing the user's utterance is selected according to the classification result by the prediction unit. The information processing apparatus according to (17) or (18). (20) Acquires inspiration information indicating the user's inspiration, Based on the acquired inspiration information, it is predicted whether or not the user speaks after the inspiration of the user. An information processing method that executes processing.
- Information processing system 100 Server device (information processing device) 110 Communication unit 120 Storage unit 121 Intake information storage unit 122 User information storage unit 123 Threshold information storage unit 124 Function information storage unit 130 Control unit 131 Acquisition unit 132 Prediction unit 133 Selection unit 134 Execution unit 135 Transmission unit 10 Terminal device 11 Communication unit 12 Input unit 13 Output unit 14 Storage unit 15 Control unit 151 Reception unit 152 Execution unit 153 Reception unit 154 Transmission unit 16 Display unit 17 Sensor unit 171 Breath sensor 18 Light source unit
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- User Interface Of Digital Computer (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
1.実施形態
1-1.本開示の実施形態に係る情報処理の概要
1-1-1.背景及び効果等
1-1-2.センサ例
1-1-2-1.接触型
1-1-2-2.非接触型
1-2.実施形態に係る情報処理システムの構成
1-3.実施形態に係る情報処理装置の構成
1-4.実施形態に係る端末装置の構成
1-5.実施形態に係る情報処理の手順
1-5-1.情報処理装置に係る処理の手順
1-5-2.情報処理システムに係る処理の手順
1-6.分類結果を用いた処理例
1-6-1.呼吸状態による起動ワードの省略例
1-6-2.ローカル/クラウド音声認識の切替え例
1-6-3.音声認識辞書の変更例
1-6-4.吸気状態により選択されるUIの変更例
1-6-5.システムレスポンスの変更例
2.その他の実施形態
2-1.クライアント側で予測処理等を行う構成例
2-2.その他の構成例
2-3.その他
3.本開示に係る効果
4.ハードウェア構成
[1-1.本開示の実施形態に係る情報処理の概要]
図1は、本開示の実施形態に係る情報処理の一例を示す図である。本開示の実施形態に係る情報処理は、サーバ装置100(図6参照)や端末装置10(図8参照)を含む情報処理システム1(図5参照)によって実現される。
既存の音声UI(User Interface)では、発話するユーザの状態を考慮していないため、様々なユーザ負担を強いている場合がある。例えば、ユーザは起動ワードを(音声)入力したり、起動ボタンを操作したりする必要がある。
図1の例では、ユーザの吸気を示す吸気情報を検知する呼吸センサ171の一例としてミリ波レーダを用いた場合を説明したが、呼吸センサ171は、ミリ波レーダに限らず、ユーザの吸気情報を検知可能であれば、どのようなセンサであってもよい。この点について以下、例示を記載する。
図1の例では、ミリ波レーダを用いた呼吸センサ171、すなわち非接触型のセンサを用いた吸気情報の検知を例として説明したが、吸気情報の検知(取得)に用いるセンサは、非接触型に限らず接触型であってもよい。以下、接触型のセンサの例示を記載する。
また、非接触型のセンサもミリ波レーダに限らず、呼吸センサ171には、種々の非接触型のセンサが用いられてもよい。以下、ミリ波レーダ以外の非接触型のセンサの例示を記載する。
・煩わしさのない呼吸センシング方法<https://shingi.jst.go.jp/past_abst/abst/p/09/919/tama2.pdf>
・人の動きや呼吸を見守る静電容量型フィルム状近接センサ<https://www.aist.go.jp/aist_j/press_release/pr2016/pr20160125/pr20160125.html>
・心拍・呼吸検出センサー『GZS-350シリーズ』<https://www.ipros.jp/product/detail/2000348329/>
図5に示す情報処理システム1について説明する。図5に示すように、情報処理システム1は、端末装置10と、サーバ装置100とが含まれる。端末装置10と、サーバ装置100とは所定の通信網(ネットワークN)を介して、有線または無線により通信可能に接続される。図5は、実施形態に係る情報処理システムの構成例を示す図である。なお、図5に示した情報処理システム1には、複数台の端末装置10や、複数台のサーバ装置100が含まれてもよい。
次に、実施形態に係る情報処理を実行する情報処理装置の一例であるサーバ装置100の構成について説明する。図6は、本開示の実施形態に係るサーバ装置100の構成例を示す図である。
次に、実施形態に係る情報処理を実行する情報処理装置の一例である端末装置10の構成について説明する。図8は、本開示の実施形態に係る端末装置の構成例を示す図である。
次に、図9、図10を用いて、実施形態に係る各種情報処理の手順について説明する。
まず、図9を用いて、本開示の実施形態に係る情報処理装置に係る処理の流れについて説明する。図9は、本開示の実施形態に係る情報処理装置の処理手順を示すフローチャートである。具体的には、図9は、サーバ装置100による情報処理の手順を示すフローチャートである。
次に、図10を用いて、本開示の実施形態に係る情報処理システムに係る処理の流れについて説明する。図10は、本開示の実施形態に係る情報処理システムの処理手順を示すシーケンス図である。
サーバ装置100は、上述した予測処理に限らず、種々の処理を行ってもよい。例えば、サーバ装置100は、ユーザの吸気を分類する分類処理を行ってもよい。この点について、以下説明する。なお、以下の説明では、図1と同様の点については適宜説明を省略する。
分類処理の一例について、図11を用いて説明する。図11は、吸気の分類結果を用いた処理の一例を示す図である。図11は、呼吸状態による起動ワードの省略例を示す。
分類処理の一例について、図14を用いて説明する。図14は、吸気の分類結果を用いた処理の一例を示す図である。図14は、ローカル/クラウド音声認識の切替え例を示す。
分類処理の一例について、図17を用いて説明する。図17は、吸気の分類結果を用いた処理の一例を示す図である。図17は、音声認識辞書の変更例を示す。
なお、情報処理システム1は、上述した例に限らず、長文型吸気~短文型吸気の分類に基づいて、種々の情報や処理を選択してもよい。この点について図20を用いて説明する。図20は、吸気の分類結果を用いた処理の一例を示す図である。
分類処理の一例について、図21及び図22を用いて説明する。図21及び図22は、吸気の分類結果を用いた処理の一例を示す図である。まず、図21の例について説明する。図21は、システムレスポンス(Text-To-Speech)の変更例を示す。具体的には、図21は、Wuw(Wake up Word)時のレスポンス変化の例を示す。
上述した各実施形態に係る処理は、上記各実施形態や変形例以外にも種々の異なる形態(変形例)にて実施されてよい。
実施形態においては、システム構成の一例として、サーバ装置100が予測処理や分類処理等を行う場合を示したが、端末装置10が予測処理や分類処理を行ってもよい。すなわち、クライアント側の装置である端末装置10が上述した予測処理や分類処理を行う情報処理装置であってもよい。このように、情報処理システム1のシステム構成は、サーバ側の装置であるサーバ装置100が予測処理や分類処理を行う構成に限らず、クライアント側の装置である端末装置10が上述した予測処理や分類処理を行う構成であってもよい。
なお、上記の例では、サーバ装置100と端末装置10とが別体である場合を示したが、これらの装置は一体であってもよい。
また、上記各実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。
上述のように、本開示に係る情報処理装置(実施形態ではサーバ装置100)は、取得部(実施形態では取得部131)と、予測部(実施形態では予測部132)とを備える。取得部は、ユーザの吸気を示す吸気情報を取得する。予測部は、取得部により取得された吸気情報に基づいて、ユーザの吸気の後にユーザが発話するか否かを予測する。
上述してきた各実施形態に係るサーバ装置100や端末装置10等の情報機器は、例えば図26に示すような構成のコンピュータ1000によって実現される。図26は、情報処理装置の機能を実現するコンピュータ1000の一例を示すハードウェア構成図である。以下、実施形態に係るサーバ装置100を例に挙げて説明する。コンピュータ1000は、CPU1100、RAM1200、ROM(Read Only Memory)1300、HDD(Hard Disk Drive)1400、通信インターフェイス1500、及び入出力インターフェイス1600を有する。コンピュータ1000の各部は、バス1050によって接続される。
(1)
ユーザの吸気を示す吸気情報を取得する取得部と、
前記取得部により取得された前記吸気情報に基づいて、前記ユーザの前記吸気の後に前記ユーザが発話するか否かを予測する予測部と、
を備える情報処理装置。
(2)
前記取得部は、
前記ユーザの前記吸気の増加量を含む前記吸気情報を取得し、
前記予測部は、
前記増加量に基づいて、前記吸気の後に前記ユーザが発話するか否かを予測する、
(1)に記載の情報処理装置。
(3)
前記取得部は、
前記ユーザの前記吸気の吸気量を含む前記吸気情報を取得し、
前記予測部は、
前記吸気量に基づいて、前記吸気の後に前記ユーザが発話するか否かを予測する、
(1)または(2)に記載の情報処理装置。
(4)
前記取得部は、
前記ユーザの前記吸気の開始時点の初期吸気量を含む前記吸気情報を取得し、
前記予測部は、
前記初期吸気量に基づいて、前記吸気の後に前記ユーザが発話するか否かを予測する、
(1)~(3)のいずれか1項に記載の情報処理装置。
(5)
前記予測部は、
前記吸気情報に基づいて算出されるスコアを用いて、前記吸気の後に前記ユーザが発話するか否かを予測する、
(1)~(4)のいずれか1項に記載の情報処理装置。
(6)
前記予測部は、
前記スコアと閾値との比較結果が所定の条件を満たす場合、前記吸気の後に前記ユーザが発話すると予測する、
(5)に記載の情報処理装置。
(7)
前記予測部による予測結果に応じて処理を実行する実行部、
(1)~(6)のいずれか1項に記載の情報処理装置。
(8)
前記実行部は、
前記予測部により前記吸気の後に前記ユーザが発話すると予測された場合、音声認識に関する事前処理を実行する
(7)に記載の情報処理装置。
(9)
前記実行部は、
前記ユーザの前記吸気が終了する前に、前記事前処理を実行する
(8)に記載の情報処理装置。
(10)
前記予測部は、
前記吸気情報に基づいて、前記ユーザの前記吸気を分類する、
(1)~(9)のいずれか1項に記載の情報処理装置。
(11)
前記取得部は、
前記ユーザの前記吸気の最大吸気量を含む前記吸気情報を取得し、
前記予測部は、
前記最大吸気量に基づいて、前記ユーザの前記吸気を分類する、
(10)に記載の情報処理装置。
(12)
前記取得部は、
前記ユーザの前記吸気の後の発話開始時点を示す時点情報を取得し、
前記予測部は、
前記最大吸気量の時点と前記発話開始時点との間隔に基づいて、前記ユーザの前記吸気を分類する、
(11)に記載の情報処理装置。
(13)
前記取得部は、
前記ユーザの前記吸気の後の発話の長さと文字数を含む発話情報を取得し、
前記予測部は、
前記発話の長さと文字数とに基づいて、前記ユーザの前記吸気を分類する、
(10)~(12)のいずれか1項に記載の情報処理装置。
(14)
前記予測部は、
前記ユーザの前記吸気を、少なくとも依頼型吸気及び非依頼型吸気を含む複数のタイプのうちのいずれかに分類する、
(10)~(13)のいずれか1項に記載の情報処理装置。
(15)
前記予測部は、
前記ユーザの前記吸気を、少なくとも長文型吸気及び短文型吸気を含む複数のタイプのうちのいずれかに分類する、
(10)~(13)のいずれか1項に記載の情報処理装置。
(16)
前記予測部は、
前記ユーザの前記吸気を、少なくとも通常処理希望型吸気及び短縮処理希望型吸気を含む複数のタイプのうちのいずれかに分類する、
(10)~(13)のいずれか1項に記載の情報処理装置。
(17)
前記予測部による分類結果に応じた選択処理を行う選択部、
をさらに備える(10)~(16)のいずれか1項に記載の情報処理装置。
(18)
前記選択部は、
前記予測部による分類結果に応じて、実行する処理を選択する、
(17)に記載の情報処理装置。
(19)
前記選択部は、
前記予測部による分類結果に応じて、前記ユーザの発話に対する処理に用いる情報を選択する、
(17)または(18)に記載の情報処理装置。
(20)
ユーザの吸気を示す吸気情報を取得し、
取得した前記吸気情報に基づいて、前記ユーザの前記吸気の後に前記ユーザが発話するか否かを予測する、
処理を実行する情報処理方法。
100 サーバ装置(情報処理装置)
110 通信部
120 記憶部
121 吸気情報記憶部
122 ユーザ情報記憶部
123 閾値情報記憶部
124 機能情報記憶部
130 制御部
131 取得部
132 予測部
133 選択部
134 実行部
135 送信部
10 端末装置
11 通信部
12 入力部
13 出力部
14 記憶部
15 制御部
151 受信部
152 実行部
153 受付部
154 送信部
16 表示部
17 センサ部
171 呼吸センサ
18 光源部
Claims (20)
- ユーザの吸気を示す吸気情報を取得する取得部と、
前記取得部により取得された前記吸気情報に基づいて、前記ユーザの前記吸気の後に前記ユーザが発話するか否かを予測する予測部と、
を備える情報処理装置。 - 前記取得部は、
前記ユーザの前記吸気の増加量を含む前記吸気情報を取得し、
前記予測部は、
前記増加量に基づいて、前記吸気の後に前記ユーザが発話するか否かを予測する、
請求項1に記載の情報処理装置。 - 前記取得部は、
前記ユーザの前記吸気の吸気量を含む前記吸気情報を取得し、
前記予測部は、
前記吸気量に基づいて、前記吸気の後に前記ユーザが発話するか否かを予測する、
請求項1に記載の情報処理装置。 - 前記取得部は、
前記ユーザの前記吸気の開始時点の初期吸気量を含む前記吸気情報を取得し、
前記予測部は、
前記初期吸気量に基づいて、前記吸気の後に前記ユーザが発話するか否かを予測する、
請求項1に記載の情報処理装置。 - 前記予測部は、
前記吸気情報に基づいて算出されるスコアを用いて、前記吸気の後に前記ユーザが発話するか否かを予測する、
請求項1に記載の情報処理装置。 - 前記予測部は、
前記スコアと閾値との比較結果が所定の条件を満たす場合、前記吸気の後に前記ユーザが発話すると予測する、
請求項5に記載の情報処理装置。 - 前記予測部による予測結果に応じて処理を実行する実行部、
請求項1に記載の情報処理装置。 - 前記実行部は、
前記予測部により前記吸気の後に前記ユーザが発話すると予測された場合、音声認識に関する事前処理を実行する
請求項7に記載の情報処理装置。 - 前記実行部は、
前記ユーザの前記吸気が終了する前に、前記事前処理を実行する
請求項8に記載の情報処理装置。 - 前記予測部は、
前記吸気情報に基づいて、前記ユーザの前記吸気を分類する、
請求項1に記載の情報処理装置。 - 前記取得部は、
前記ユーザの前記吸気の最大吸気量を含む前記吸気情報を取得し、
前記予測部は、
前記最大吸気量に基づいて、前記ユーザの前記吸気を分類する、
請求項10に記載の情報処理装置。 - 前記取得部は、
前記ユーザの前記吸気の後の発話開始時点を示す時点情報を取得し、
前記予測部は、
前記最大吸気量の時点と前記発話開始時点との間隔に基づいて、前記ユーザの前記吸気を分類する、
請求項11に記載の情報処理装置。 - 前記取得部は、
前記ユーザの前記吸気の後の発話の長さと文字数を含む発話情報を取得し、
前記予測部は、
前記発話の長さと文字数とに基づいて、前記ユーザの前記吸気を分類する、
請求項10に記載の情報処理装置。 - 前記予測部は、
前記ユーザの前記吸気を、少なくとも依頼型吸気及び非依頼型吸気を含む複数のタイプのうちのいずれかに分類する、
請求項10に記載の情報処理装置。 - 前記予測部は、
前記ユーザの前記吸気を、少なくとも長文型吸気及び短文型吸気を含む複数のタイプのうちのいずれかに分類する、
請求項10に記載の情報処理装置。 - 前記予測部は、
前記ユーザの前記吸気を、少なくとも通常処理希望型吸気及び短縮処理希望型吸気を含む複数のタイプのうちのいずれかに分類する、
請求項10に記載の情報処理装置。 - 前記予測部による分類結果に応じた選択処理を行う選択部、
をさらに備える請求項10に記載の情報処理装置。 - 前記選択部は、
前記予測部による分類結果に応じて、実行する処理を選択する、
請求項17に記載の情報処理装置。 - 前記選択部は、
前記予測部による分類結果に応じて、前記ユーザの発話に対する処理に用いる情報を選択する、
請求項17に記載の情報処理装置。 - ユーザの吸気を示す吸気情報を取得し、
取得した前記吸気情報に基づいて、前記ユーザの前記吸気の後に前記ユーザが発話するか否かを予測する、
処理を実行する情報処理方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021574591A JP7597040B2 (ja) | 2020-01-31 | 2021-01-12 | 情報処理装置及び情報処理方法 |
| US17/794,633 US12198694B2 (en) | 2020-01-31 | 2021-01-12 | Information processing apparatus and information processing method |
| EP21748287.6A EP4099321A4 (en) | 2020-01-31 | 2021-01-12 | INFORMATION PROCESSING DEVICE AND METHOD |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2020-014529 | 2020-01-31 | ||
| JP2020014529 | 2020-01-31 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021153201A1 true WO2021153201A1 (ja) | 2021-08-05 |
Family
ID=77078788
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2021/000600 Ceased WO2021153201A1 (ja) | 2020-01-31 | 2021-01-12 | 情報処理装置及び情報処理方法 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12198694B2 (ja) |
| EP (1) | EP4099321A4 (ja) |
| JP (1) | JP7597040B2 (ja) |
| WO (1) | WO2021153201A1 (ja) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4099318A4 (en) * | 2020-01-31 | 2023-05-10 | Sony Group Corporation | Information processing device and information processing method |
| JP7670105B1 (ja) * | 2023-11-15 | 2025-04-30 | 富士フイルムビジネスイノベーション株式会社 | 情報処理システムおよびプログラム |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11249773A (ja) * | 1998-02-27 | 1999-09-17 | Toshiba Corp | マルチモーダルインタフェース装置およびマルチモーダルインタフェース方法 |
| JP2012032557A (ja) * | 2010-07-30 | 2012-02-16 | Internatl Business Mach Corp <Ibm> | 音声に含まれる吸気音を検出する装置、方法、及びプログラム |
| JP2016042345A (ja) * | 2014-08-13 | 2016-03-31 | 日本電信電話株式会社 | 推定装置、その方法およびプログラム |
| JP2017211596A (ja) | 2016-05-27 | 2017-11-30 | トヨタ自動車株式会社 | 音声対話システムおよび発話タイミング決定方法 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP6335157B2 (ja) | 2015-12-24 | 2018-05-30 | 日本電信電話株式会社 | 会話支援システム、会話支援装置及び会話支援プログラム |
| JP6775387B2 (ja) | 2016-11-11 | 2020-10-28 | 日本電信電話株式会社 | 推定方法及び推定システム |
| JP6923827B2 (ja) * | 2017-11-10 | 2021-08-25 | 日本電信電話株式会社 | コミュニケーションスキル評価システム、装置、方法、及びプログラム |
-
2021
- 2021-01-12 JP JP2021574591A patent/JP7597040B2/ja active Active
- 2021-01-12 EP EP21748287.6A patent/EP4099321A4/en not_active Withdrawn
- 2021-01-12 WO PCT/JP2021/000600 patent/WO2021153201A1/ja not_active Ceased
- 2021-01-12 US US17/794,633 patent/US12198694B2/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11249773A (ja) * | 1998-02-27 | 1999-09-17 | Toshiba Corp | マルチモーダルインタフェース装置およびマルチモーダルインタフェース方法 |
| JP2012032557A (ja) * | 2010-07-30 | 2012-02-16 | Internatl Business Mach Corp <Ibm> | 音声に含まれる吸気音を検出する装置、方法、及びプログラム |
| JP2016042345A (ja) * | 2014-08-13 | 2016-03-31 | 日本電信電話株式会社 | 推定装置、その方法およびプログラム |
| JP2017211596A (ja) | 2016-05-27 | 2017-11-30 | トヨタ自動車株式会社 | 音声対話システムおよび発話タイミング決定方法 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4099321A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20230064042A1 (en) | 2023-03-02 |
| US12198694B2 (en) | 2025-01-14 |
| JPWO2021153201A1 (ja) | 2021-08-05 |
| EP4099321A4 (en) | 2023-05-24 |
| EP4099321A1 (en) | 2022-12-07 |
| JP7597040B2 (ja) | 2024-12-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11810557B2 (en) | Dynamic and/or context-specific hot words to invoke automated assistant | |
| EP3895161B1 (en) | Utilizing pre-event and post-event input streams to engage an automated assistant | |
| EP3759709B1 (en) | Selectively activating on-device speech recognition, and using recognized text in selectively activating on-device nlu and/or on-device fulfillment | |
| US12051416B2 (en) | Methods and systems for reducing latency in automated assistant interactions | |
| JP7173049B2 (ja) | 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム | |
| US10643637B2 (en) | Retroactive sound identification system | |
| JP7597040B2 (ja) | 情報処理装置及び情報処理方法 | |
| US20250285621A1 (en) | Selectively activating on-device speech recognition, and using recognized text in selectively activating on-device nlu and/or on-device fulfillment | |
| WO2019239659A1 (ja) | 情報処理装置および情報処理方法 | |
| JP7375741B2 (ja) | 情報処理装置、情報処理方法、および、プログラム | |
| US11948580B2 (en) | Collaborative ranking of interpretations of spoken utterances | |
| JP2026012740A (ja) | 口頭の発話を完了するための継続的なコンテンツの選択的生成および/または選択的レンダリング | |
| WO2021153427A1 (ja) | 情報処理装置及び情報処理方法 | |
| US20240331681A1 (en) | Automatic adaptation of the synthesized speech output of a translation application | |
| JP7661896B2 (ja) | 情報処理装置及び情報処理方法 | |
| KR20190018666A (ko) | 기계의 자동 활성을 위한 방법 및 시스템 | |
| CN117136405A (zh) | 使用大型语言模型生成自动化助理响应 | |
| US12254886B2 (en) | Collaborative ranking of interpretations of spoken utterances | |
| CN117121098A (zh) | 机器学习模型的短暂学习 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21748287 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021574591 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021748287 Country of ref document: EP Effective date: 20220831 |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2021748287 Country of ref document: EP |