JPH11288342A

JPH11288342A - Interface device and method for multimodal input / output device

Info

Publication number: JPH11288342A
Application number: JP16344998A
Authority: JP
Inventors: Katsumi Tanaka; 克己田中; Tetsuro Chino; 哲朗知野
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-02-09
Filing date: 1998-06-11
Publication date: 1999-10-19
Anticipated expiration: 2018-06-11
Also published as: JP3822357B2

Abstract

(57)【要約】【課題】利用者の自由な入力を許し、また周囲環境に
適応可能なマルチモーダル入出力装置のインタフェース
装置を提供する。【構成】画像入力に基づく視線検出エンジン101 、音
声入力に基づく音声認識エンジン102 、マウス・キーボ
ード等からなる操作入力部103 、前記101 〜103よりの
入力を統合し、利用者の意図を検出する入力統合部104
、意図検出結果に基づき利用者に出力を行なうフィー
ドバック生成部105 を持ち、アイコンに対する利用者の
選択指示を、前記101 〜103 の情報をもとに、利用者の
考える意図を検出すると共に、その検出結果に基づき対
応するアイコンに対しての選択指示を認識したことを知
らせるべく、利用者に所要の呈示を返すようにするため
に、前記意図の検出は入力情報間の因果関係情報を用い
て実施すると共に、前記因果関係情報を利用者との実際
の入出力を通じて統計的に学習する。 (57) [Summary] [Problem] To provide an interface device of a multi-modal input / output device that allows a user to freely input and is adaptable to the surrounding environment. A gaze detection engine 101 based on image input, a voice recognition engine 102 based on voice input, an operation input unit 103 including a mouse / keyboard, etc., and inputs from the above 101 to 103 are integrated to detect a user's intention. Input integration unit 104
A feedback generation unit 105 that outputs to a user based on the intention detection result, and detects a user's intention to select an icon based on the information of 101 to 103, In order to inform the user of the recognition of the selection instruction for the corresponding icon based on the detection result, and to return the required presentation to the user, the intention is detected using causal relationship information between input information. At the same time, the causal relationship information is statistically learned through actual input / output with the user.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は入力された視覚情
報、音声情報、操作情報のうち少なくとも一つの入力あ
るいは出力を通じて利用者の意図を推定し、それに基づ
き利用者にフィードバックを返すマルチモーダル入出力
インタフェースに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multimodal input / output that estimates a user's intention through at least one input or output of visual information, voice information, and operation information that is input, and that returns feedback to the user based on the input or output. Regarding the interface.

【０００２】[0002]

【従来の技術】近年、パーソナルコンピュータを含む計
算機システムにおいて、従来のキーボードやマウスなど
による入力と、ディスプレイなどによる文字や画像情報
の出力に加えて、音声情報や画像情報などマルチメディ
ア情報を入出力することが可能になって来ている。こう
いった状況に加え、自然言語解析や自然言語生成、ある
いは音声認識や音声合成技術あるいは対話処理技術の進
歩などによって、利用者と音声入出力を対話する音声対
話システムへの要求が高まっており、自由発話による音
声入力によって利用可能な対話システムである"TOSBURG
-II"（信学論、Vol.J77-D-II、No.8,pp 1417-1428,1994)
など、様々な音声対話システムの開発がなされている。2. Description of the Related Art In recent years, in a computer system including a personal computer, multimedia information such as voice information and image information has been input and output in addition to the conventional input by a keyboard and a mouse and the output of characters and image information by a display. It is becoming possible to do so. In addition to these circumstances, natural language analysis, natural language generation, and advances in speech recognition, speech synthesis, and dialog processing technologies have led to a growing demand for speech dialogue systems that allow users to interact with speech input and output. , "TOSBURG, a dialogue system that can be used by voice input using free speech
-II "(IEICE, Vol.J77-D-II, No.8, pp 1417-1428, 1994)
Various speech dialogue systems have been developed.

【０００３】また、さらに、こう言った音声入出力に加
え、例えばカメラを使った視覚情報入力を利用したり、
あるいは、タッチパネルや、ペンや、タブレットや、デ
−タグローブや、フットスイッチや、対人センサや、ヘ
ッドマウントディスプレイや、フォースディスプレイ
（提力装置）など、様々な入出力デバイスを通じて利用
者と授受できる情報を利用して、利用者とインタラクシ
ョンを行なうマルチモーダル対話システムへの要求が高
まっている。Further, in addition to the voice input / output described above, for example, visual information input using a camera is used,
Alternatively, information that can be exchanged with the user through various input / output devices such as a touch panel, a pen, a tablet, a data glove, a foot switch, an interpersonal sensor, a head-mounted display, a force display (force-providing device), and the like. There is an increasing demand for a multi-modal dialogue system for interacting with a user by using the Internet.

【０００４】このマルチモーダルインタフェースは、人
間同士の対話においても、例えば音声など一つのメディ
ア（チャネル）のみを用いてコミュニケーションを行な
っている訳ではなく、身振りや手ぶりあるいは表情とい
った様々なメディアを通じて授受される非言語メッセー
ジを駆使して対話することによって、自然で円滑なイン
タラクションを行なっている("Intelligent Multimedia
Interfaces",MayburyM.T,Eds.,The AAAI Press/The MI
T Press,1993)ことから考えても、自然で使いやすいヒ
ューマンインタフェースを実現するための一つの有力な
方法として期待が高まっている。This multi-modal interface does not use only one medium (channel), such as voice, for communication in human-to-human communication, but exchanges information through various media such as gestures, hand gestures, and facial expressions. It interacts naturally with non-verbal messages to create a natural and smooth interaction ("Intelligent Multimedia
Interfaces ", MayburyM.T, Eds., The AAAI Press / The MI
(T Press, 1993) Considering this, expectations are growing as one of the leading methods to realize a natural and easy-to-use human interface.

【０００５】[0005]

【発明が解決しようとする課題】しかし、従来、それぞ
れのメディアからの入力の解析精度の低さや、それぞれ
の入出力メディアの性質が明らかとなっていないため、
新たに利用可能となった各入出力メディアあるいは、複
数の入出力メディアを効率的に利用し、高能率で、効果
的で、利用者の負担を軽減する、マルチモーダルインタ
フェースは実現されていない。However, since the analysis accuracy of the input from each medium and the characteristics of each input / output medium have not been clarified,
A multi-modal interface that efficiently uses each newly available input / output medium or a plurality of input / output media, is highly efficient, effective, and reduces the burden on the user has not been realized.

【０００６】とくに、画像・音声など、それぞれの認識
装置からの入力を統合する場合は、あらかじめ定められ
た認識手段から送られる情報の確からしさをあらかじめ
想定し、あらかじめ定められた入力順序を想定して行な
われていた。そのため、ある周囲環境の変化によりある
認識装置の精度が低下した場合にはそれに対応した入力
解釈のための処理が行なわれずに解釈部が停滞したり、
容易に誤作動してしまうという問題があった。また利用
者特有の入力順序には対応できず、利用者がシステム側
で受け付け可能な入力手段を習得する必要があり、利便
度を著しく低下させている。[0006] In particular, when integrating inputs from respective recognizing devices, such as images and sounds, the likelihood of information sent from predetermined recognizing means is assumed in advance, and a predetermined input order is assumed. Had been done. Therefore, when the accuracy of a certain recognition device is reduced due to a change in a certain surrounding environment, the interpreting unit is stagnated without performing the corresponding input interpretation process,
There is a problem that malfunctions easily occur. In addition, it is not possible to cope with the input order peculiar to the user, and it is necessary for the user to learn input means that can be accepted on the system side, which greatly reduces convenience.

【０００７】そこで本発明の目的は、このようなマルチ
モーダル入出力装置のインタフェース装置及びその方法
に対して、利用者のより自由な入力を許し、また周囲環
境の変化に対して適応可能な入出力方法を提供すること
にある。Accordingly, an object of the present invention is to provide a multimodal input / output device interface device and a method thereof which allow a user more free input and can adapt to changes in the surrounding environment. To provide an output method.

【０００８】[0008]

【課題を解決するための手段】請求項１の発明は、処理
内容が定められた種々の操作対象が予め用意してあり、
これら操作対象に対する利用者の選択指示を、利用者の
視線入力情報、音声入力情報、操作入力情報、画像入力
情報及び動作入力情報のうち、少なくとも一つ以上の情
報を認識して得た情報をもとに、利用者の考える意図を
検出すると共に、その検出結果に基づき対応する前記操
作対象に対しての選択指示を認識したことを知らせるべ
く、利用者に所要の呈示を返すようにしたマルチモーダ
ル入出力装置のインタフェース装置であって、前記意図
の検出は入力情報間の因果関係情報を用いて実施すると
共に、前記因果関係情報を利用者との実際の入出力を通
じて統計的に学習する構成とする統合手段により実施す
る構成とすることを特徴とするマルチモーダル入出力装
置のインタフェース装置である。According to the first aspect of the present invention, various operation objects having predetermined processing contents are prepared in advance,
A user's selection instruction for these operation targets is obtained by recognizing at least one or more of the user's gaze input information, voice input information, operation input information, image input information, and operation input information. Based on the result, the user is required to detect the intention and, based on the detection result, return a required presentation to the user in order to notify that the user has recognized the selection instruction for the corresponding operation target. An interface device for a modal input / output device, wherein the intention is detected using causal relationship information between input information, and the causal relationship information is statistically learned through actual input / output with a user. An interface device for a multi-modal input / output device, wherein the interface device is configured to be implemented by integrating means.

【０００９】請求項２の発明は、請求項１記載のマルチ
モーダル入出力装置のインタフェース装置において、前
記統合手段は、時刻を管理する時刻管理手段と、現在及
び過去の時刻における入力情報を保持する保持手段とを
有し、前記因果関係情報は前記保持手段の保持した現在
及び過去の入力情報を用いるようにしたことを特徴とす
るマルチモーダル入出力装置のインタフェース装置であ
る。According to a second aspect of the present invention, in the interface device of the multimodal input / output device according to the first aspect, the integrating means holds time management means for managing time and input information at present and past times. And an interface device for a multi-modal input / output device, wherein the causal relationship information uses current and past input information held by the holding device.

【００１０】請求項３の発明は、請求項１または２いず
れか１項記載のマルチモーダル入出力装置のインタフェ
ース装置において、前記統合手段は、入力情報間または
入力情報と入力時刻の間の因果関係情報を利用者との実
際の入出力を通じて統計的に学習するものであることを
特徴とするマルチモーダル入出力装置のインタフェース
装置である。According to a third aspect of the present invention, in the interface device of the multimodal input / output device according to any one of the first and second aspects, the integrating means includes a causal relationship between input information or between input information and input time. An interface device of a multi-modal input / output device characterized by learning information statistically through actual input / output with a user.

【００１１】請求項４の発明は、処理内容が定められた
種々の操作対象が予め用意してあり、これら操作対象に
対する利用者の選択指示を、利用者の視線入力情報、音
声入力情報、操作入力情報、画像入力情報及び動作入力
情報のうち、少なくとも一つ以上の情報を認識して得た
情報をもとに、利用者の考える意図を検出すると共に、
その検出結果に基づき対応する前記操作対象に対しての
選択指示を認識したことを知らせるべく、利用者に所要
の呈示を返すようにしたマルチモーダル入出力装置のイ
ンタフェース装置において、統合手段は各操作対象毎に
独立して設けることを特徴とするマルチモーダル入出力
装置のインタフェース装置である。According to a fourth aspect of the present invention, various operation objects having predetermined processing contents are prepared in advance, and a user's selection instruction for these operation objects is provided by the user's line-of-sight input information, voice input information, and operation information. Based on information obtained by recognizing at least one or more of the input information, the image input information, and the operation input information, the intention of the user is detected,
In the interface device of the multi-modal input / output device configured to return a required presentation to a user in order to notify that the selection instruction for the corresponding operation target has been recognized based on the detection result, An interface device for a multimodal input / output device, which is provided independently for each target.

【００１２】請求項５の発明は、請求項３記載のマルチ
モーダル入出力装置のインタフェース装置において、前
記統合手段は、特定操作対象中の意図検出結果を、他の
操作対象群との位置的・形状的・色彩的・言語的な関係
に基づき決定するものであることを特徴とするマルチモ
ーダル入出力装置のインタフェース装置である。According to a fifth aspect of the present invention, in the interface device for a multi-modal input / output device according to the third aspect, the integrating means determines the intention detection result in the specific operation target by using a positional / An interface device of a multi-modal input / output device, which is determined based on a shape, color, and language relationship.

【００１３】請求項６の発明は、請求項３記載のマルチ
モーダル入出力装置のインタフェース装置において、前
記統合手段は、特定操作対象については、自らの意図検
出結果と他の操作対象群の意図検出結果に基づき、他の
操作対象群との位置的・形状的・色彩的・言語的な関係
を変更するものであることを特徴とするマルチモーダル
入出力装置のインタフェース装置である。According to a sixth aspect of the present invention, in the interface device of the multi-modal input / output device according to the third aspect, the integrating means detects, with respect to the specific operation target, its own intention detection result and the intention detection of another operation target group. An interface device of a multimodal input / output device characterized by changing a positional, a shape, a color, and a linguistic relationship with another operation target group based on a result.

【００１４】請求項７の発明は、請求項１または２いず
れか１項記載のマルチモーダル入出力装置のインタフェ
ース装置において、前記統合手段は、意図情報を、過去
の時間における意図情報、または、利用者へ返した所要
の呈示結果より得ることを特徴とするマルチモーダル入
出力装置のインターフェース装置である。According to a seventh aspect of the present invention, in the interface device of the multimodal input / output device according to any one of the first and second aspects, the integrating means converts the intention information into the intention information at a past time or using the intention information. An interface device for a multi-modal input / output device, which is obtained from a required presentation result returned to a user.

【００１５】請求項８の発明は、請求項１記載のマルチ
モーダル入出力装置のインターフェース装置において、
因果関係情報の学習を、前記統合手段より得られる意図
情報に基づいて開始または終了することを特徴とするマ
ルチモーダル入出力装置のインターフェース装置であ
る。According to the invention of claim 8, in the interface device of the multimodal input / output device according to claim 1,
An interface device for a multi-modal input / output device, characterized in that learning of causal relationship information is started or ended based on intention information obtained by the integration means.

【００１６】請求項９の発明は、請求項１記載のマルチ
モーダル入出力装置のインターフェース装置において、
前記操作対象による利用者への所要の呈示を返すための
出力は、前記学習時に行うことを特徴とするマルチモー
ダル入出力装置のインターフェース装置である。According to a ninth aspect of the present invention, in the interface device of the multimodal input / output device according to the first aspect,
The output for returning the required presentation to the user by the operation target is performed at the time of the learning, and is an interface device of a multimodal input / output device.

【００１７】請求項１０の発明は、請求項１記載のマル
チモーダル入出力装置のインターフェース装置におい
て、前記統合手段は、前記種々の入力情報のうち少なく
とも一つを意図検出結果の確認、または、取消しのため
の情報として用いることを特徴とするマルチモーダル入
出力装置のインターフェース装置である。According to a tenth aspect of the present invention, in the interface device of the multimodal input / output device according to the first aspect, the integrating means confirms or cancels at least one of the various input information with an intention detection result. An interface device of a multi-modal input / output device, which is used as information for a multi-modal input / output device.

【００１８】請求項１１の発明は、処理内容が定められ
た種々の操作対象が予め用意してあり、これら操作対象
に対する利用者の選択指示を、利用者の視線入力情報、
音声入力情報、操作入力情報、画像入力情報及び動作入
力情報のうち、少なくとも一つ以上の情報を認識して得
た情報をもとに、利用者の考える意図を検出すると共
に、その検出結果に基づき対応する前記操作対象に対し
ての選択指示を認識したことを知らせるべく、利用者に
所要の呈示を返すようにしたマルチモーダル入出力装置
のインタフェース方法であって、前記意図の検出は入力
情報間の因果関係情報を用いて実施すると共に、前記因
果関係情報を利用者との実際の入出力を通じて統計的に
学習する構成とすることを特徴とするマルチモーダル入
出力装置のインタフェース方法である。According to an eleventh aspect of the present invention, various operation objects having predetermined processing contents are prepared in advance, and a user's selection instruction for these operation objects is provided by user's line-of-sight input information,
Based on information obtained by recognizing at least one of the voice input information, the operation input information, the image input information, and the operation input information, a user's intention is detected, and the detection result is included in the detection result. An interface method for a multi-modal input / output device configured to return a required presentation to a user in order to notify that a selection instruction for the corresponding operation target has been recognized based on the input information. An interface method for a multi-modal input / output device, wherein the method is implemented by using causal relationship information between the devices and learning the causal relationship information statistically through actual input / output with a user.

【００１９】請求項１２の発明は、請求項７記載のマル
チモーダル入出力装置のインタフェース方法において、
前記学習は、入力情報間または入力情報と入力時刻の間
の因果関係情報を利用者との実際の入出力を通じて統計
的に実施するものであることを特徴とするマルチモーダ
ル入出力装置のインタフェース方法である。According to a twelfth aspect of the present invention, in the interface method of the multimodal input / output device according to the seventh aspect,
An interface method for a multi-modal input / output device, wherein the learning is performed by statistically performing causal relationship information between input information or between input information and input time through actual input / output with a user. It is.

【００２０】[0020]

【発明の実施の形態】以下、本発明の実施例を図面を用
いて説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００２１】第１の実施例図１は本発明の第１の実施例のシステムの全体ブロック
図であるである。[0021] First Embodiment FIG. 1 is is a general block diagram of a system according to the first embodiment of the present invention.

【００２２】このシステムのうちの一つの操作対象ごと
に、画像入力に基づく視線検出エンジン101 、音声入力
に基づく音声認識エンジン102 、マウス・キーボード等
からなる操作入力部103 、前記101 〜103 よりの入力を
統合し、利用者の意図を検出する入力統合部104 、意図
検出結果に基づき利用者に出力を行なうフィードバック
生成部105 を持つ。For each operation target of this system, a gaze detection engine 101 based on image input, a voice recognition engine 102 based on voice input, an operation input unit 103 including a mouse / keyboard, etc. It has an input integration unit 104 that integrates inputs and detects the intention of the user, and a feedback generation unit 105 that outputs to the user based on the result of intention detection.

【００２３】本実施例では、ウインドウシステムを対象
とし、操作対象は図２に示すアイコンの一つ一つとす
る。また本実施例でいう意図とは利用者の操作対象（ア
イコン）に対する選択意図とする。In this embodiment, the window system is targeted, and the operation target is one of the icons shown in FIG. The intention in the present embodiment is a user's intention to select an operation target (icon).

【００２４】（入力部）入力部101 〜103 は、入力統合
部104 に対し、操作対象と入力情報を類似度に換算した
情報を送るものとする。(Input unit) The input units 101 to 103 send information obtained by converting the operation target and input information into similarity to the input integration unit 104.

【００２５】例えば、視線検出エンジン101 は、ユーザ
の入力顔画像を解析し、視線が自らのアイコンに向けら
れている度合を0 〜1 の類似度で表し、入力統合部104
に送る。For example, the line-of-sight detection engine 101 analyzes the input face image of the user, expresses the degree to which the line of sight is directed to the icon of the user by a similarity of 0 to 1, and displays the input integration unit 104
Send to

【００２６】音声認識エンジンについても同様に、自ら
のことに言及している度合（アイコンの名前・形状・色
・位置関係など）を入力音声と保持している語彙セット
との類似性に基づき0 〜1 の類似度で表し、入力統合部
104 に送る。操作情報については、操作が行なわれた
（選択された）場合に類似度１、それ以外の場合には類
似度０の情報を入力統合部104 に送る。Similarly, for the speech recognition engine, the degree to which the speech recognition engine refers to itself (the name, shape, color, positional relationship, etc. of the icon) is set to 0 based on the similarity between the input speech and the vocabulary set holding it. Input similarity
Send to 104. As for the operation information, information of the similarity 1 is sent to the input integration unit 104 when the operation is performed (selected), and otherwise, information of the similarity 0 is sent.

【００２７】視線検出エンジン101 については、例えば
利用者の眼球運動を観察するアイトラッカ装置や、利用
者の頭部の動きを検出するヘッドトラッカ装置や、着席
センサや、例えば、特開平08-059071 「注視箇所推定装
置とその方法」で用いられている方法などによって、利
用者を観察するカメラや利用者が装着したカメラから得
られる画像情報を処理し利用者の視線方向の検出するこ
となどによって、各操作対象に対して視線が向けられて
いる度合を類似度情報として表すことにしている。The eye-gaze detection engine 101 includes, for example, an eye tracker device for observing the user's eye movements, a head tracker device for detecting the movement of the user's head, a seating sensor, and, for example, Japanese Patent Application Laid-Open No. 08-059071. Gaze point estimation device and its method '', by processing the image information obtained from the camera that observes the user or the camera worn by the user to detect the user's line of sight, etc. The degree of gaze at each operation target is represented as similarity information.

【００２８】音声認識エンジンは、利用者の音声情報を
入力としと認識対象語彙との適合度を出力する手段を有
し、このために例えば"TOSBURG-II"（信学論、Vol.J77-
D-II、No.8,pp1417-1428,1994）に述べられている方式を
用いることができる。この適合度情報より、選択対象に
対する類似度は例えば以下の式で求めることができる。The speech recognition engine has means for receiving the user's speech information as input and outputting the degree of conformity with the vocabulary to be recognized. For this purpose, for example, "TOSBURG-II" (IEEE, Vol.
D-II, No. 8, pp 1417-1428, 1994). From this matching degree information, the degree of similarity to the selection target can be obtained, for example, by the following equation.

【００２９】[0029]

【数１】選択対象アイコンごとに名前・形状・色彩・位置を示す
語彙を定義しておくことにより、上記の式１を用いて類
似度情報を求めることができる。(Equation 1) By defining a vocabulary indicating a name, a shape, a color, and a position for each selection target icon, similarity information can be obtained by using the above equation (1).

【００３０】（操作入力）操作入力103 については、キ
ーボード・マウスにより与えられたイベントの送り先の
対象をそのままその操作対象に対する類似度１の入力と
して入力統合部104 に送るものとする。(Operation Input) With respect to the operation input 103, the destination of the event given by the keyboard / mouse is sent to the input integration unit 104 as the input of the degree of similarity 1 to the operation target.

【００３１】（入力統合部）次に入力統合部104 の動作
について説明する。(Input Integration Unit) Next, the operation of the input integration unit 104 will be described.

【００３２】入力統合部104 では、各入力情報より、自
らが選択されている確率を求める。また学習により、利
用者や環境に適応してより確度の高い意図選択を行な
う。The input integration unit 104 obtains the probability of being selected from each input information. In addition, through learning, more accurate intention selection is performed in accordance with the user and the environment.

【００３３】これは例えば以下の手法を用いることがで
きる。For example, the following method can be used.

【００３４】図３に示すように、入力情報間の因果関係
をテーブルを用いて表現する。As shown in FIG. 3, a causal relationship between input information is expressed using a table.

【００３５】図３では、利用者の選択意図と、視線検出
エンジンから得られる類似度情報（視線類似度）と音声
認識エンジンから得られる類似度情報（音声類似度）と
の因果関係を頻度情報として保持している。In FIG. 3, frequency information indicates the causal relationship between the user's selection intention and the similarity information (gaze similarity) obtained from the gaze detection engine and the similarity information (speech similarity) obtained from the speech recognition engine. As hold.

【００３６】本実施例では、類似度はそれぞれ０〜１の
間を当分割している（必ずしもこのようにする必要はな
い）。In this embodiment, the degree of similarity is divided between 0 and 1 (this is not always necessary).

【００３７】図３では選択意図が視線類似度の原因とし
て表されるというモデルのもとに両者の関係を表現して
いる。FIG. 3 expresses the relationship between the two based on a model in which the selection intention is expressed as the cause of the gaze similarity.

【００３８】選択意図のある場合をＰｏｓｉｔｉｖｅ，
選択意図のない場合をＮｅｇａｔｉｖｅと呼び、それぞ
れの場合の視線類似度の分布を頻度情報として表現して
いる。[0038] Positive,
A case where there is no selection intention is called Negative, and the distribution of gaze similarity in each case is expressed as frequency information.

【００３９】図４では視線類似度と選択意図が音声類似
度の原因となるというモデルのもとに３者の関係を表現
しており、選択意図、視線類似度が与えられた場合の音
声類似度の分布情報である。FIG. 4 shows the relationship between the three persons based on a model in which the line-of-sight similarity and the selection intention cause the voice similarity, and shows the voice similarity when the selection intention and the line-of-sight similarity are given. Degree distribution information.

【００４０】入力統合部104 では一定の時刻毎に各入力
情報を受けとり、図３の表を利用して選択意図のある場
合の確率を求める。そのために以下の式１を用いる。こ
こでＧ、Ｓはそれぞれ入力情報より得られる視線類似
度、音声類似度の値を示す。The input integration unit 104 receives each input information at fixed time intervals, and obtains a probability when there is a selection intention using the table of FIG. For that purpose, the following equation 1 is used. Here, G and S indicate the values of the visual line similarity and the voice similarity obtained from the input information, respectively.

【００４１】[0041]

【数２】上式２でＧ，Ｓの値は入力情報より得られるので、図３
の表を用いて選択確率を求めることができる。例えば、
視線類似度０．８５，音声類似度０. ７３の場合には、(Equation 2) Since the values of G and S are obtained from the input information in the above equation 2, FIG.
The selection probability can be obtained by using the table. For example,
In the case of gaze similarity 0.85 and audio similarity 0.73,

【数３】となり、選択意図の確率が１であると求められる。(Equation 3) And the probability of the selection intention is determined to be 1.

【００４２】次に因果関係情報の学習方式について説明
する。Next, a method of learning causal relationship information will be described.

【００４３】ユーザからキーボード・マウス等を用いて
類似度１の操作情報が与えられた場合、情報統合部はそ
の操作対象に対して選択意図の確率を直ちに１にすると
ともに、その際の視線類似度、音声類似度の値をもと
に、図３，４の類似度分布情報を更新する。When the user gives operation information with a similarity of 1 using a keyboard / mouse or the like, the information integration unit immediately sets the probability of the selection intention to 1 for the operation target and sets the line-of-sight similarity at that time. The similarity distribution information in FIGS. 3 and 4 is updated based on the values of the degrees and the voice similarities.

【００４４】すなわち図３，４中の対応する項目に１を
加える。That is, 1 is added to the corresponding item in FIGS.

【００４５】例えば、マウスにより選択が行なわれた際
の視線類似度、音声類似度をそれぞれ0.65,0.87 とする
と、図３では視線類似度＝0.6 〜0.7 、選択＝Ｐｏｓｉ
ｔｉｖｅの欄に、図４では視線類似度＝0.6 〜0.7 、選
択＝Ｐｏｓｉｔｉｖｅ、音声類似度＝0.8 〜0.9 の欄に
それぞれ１を加えることになる。For example, assuming that the line-of-sight similarity and the voice similarity when the selection is made with the mouse are 0.65 and 0.87, respectively, in FIG. 3, the line-of-sight similarity = 0.6 to 0.7 and the selection = Posi
In the column of “active”, 1 is added to the columns of gaze similarity = 0.6 to 0.7, selection = Positive, and voice similarity = 0.8 to 0.9 in FIG.

【００４６】またその際に類似度１が与えられなかった
操作対象に対しては、選択意図の確率を直ちに０にする
とともに、その際の視線類似度、音声類似度の値をもと
に、図３，４の類似度分布情報を同様に更新する。ただ
し対応する視線類似度、音声類似度について、選択＝Ｎ
ｅｇａｔｉｖｅの欄に１を加えることになる。For the operation object to which the similarity 1 is not given at that time, the probability of the selection intention is immediately set to 0, and based on the values of the visual line similarity and the voice similarity at that time, The similarity distribution information in FIGS. 3 and 4 is similarly updated. However, for the corresponding gaze similarity and voice similarity, selection = N
1 will be added to the column of “egative”.

【００４７】このように、利用者・環境に適応して、選
択意図が確からしくなるように動的に学習を行なってい
くことが可能になる。As described above, it is possible to dynamically perform learning so as to make the selection intention clear, adapting to the user / environment.

【００４８】（フィードバック生成部）次にフィードバ
ック生成部105 の動作について説明する。(Feedback Generation Unit) Next, the operation of the feedback generation unit 105 will be described.

【００４９】フィードバック生成部105 では、入力統合
部104 から送られた選択意図確率に基づいて選択対象ア
イコンが利用者に対して行なうフィードバックを決定す
る。これは例えば図５に示す効用テーブルを参照するこ
とにより行なうことができる。The feedback generation unit 105 determines the feedback that the selection target icon gives to the user based on the selection intention probability sent from the input integration unit 104. This can be performed, for example, by referring to the utility table shown in FIG.

【００５０】本実施例では、２段階のフィードバックを
想定している（これ以上のフィードバックを想定しても
構わない）が、図５にはフィードバック動作の効用値が
記述されている。In the present embodiment, two-stage feedback is assumed (more feedback may be assumed), but FIG. 5 describes the utility value of the feedback operation.

【００５１】このテーブルに基づいて各フィードバック
動作の期待効用値を計算する。期待効用値の計算式は、
入力統合部104 より得られた選択意志確率をｘとする
と、以下の式２で表すことができる。The expected utility value of each feedback operation is calculated based on this table. The formula for calculating the expected utility value is
Assuming that the selection will probability obtained from the input integration unit 104 is x, it can be expressed by the following equation 2.

【００５２】[0052]

【数４】上式３に基づき、最も期待効用値の大きいフィードバッ
ク動作ｎを求め、実行する。例えば選択意図確率が0.6
と得られた場合には、それぞれのフィードバック動作の
期待効用値は図５のテーブルと式３を用いて次のように
計算される。(Equation 4) Based on the above equation 3, the feedback operation n having the largest expected utility value is obtained and executed. For example, the selection intention probability is 0.6
Is obtained, the expected utility value of each feedback operation is calculated as follows using the table of FIG.

【００５３】期待効用値（選択フィードバック）=0.6×
1.0+(1-0.6) ×0=0.6 期待効用値（選択候補フィードバック）=0.6×0.7+(1-
0.6) ×0.6=0.66 期待効用値（フィードバックなし）=0.6×0+(1-0.6) ×
1.0=0.4 この場合には期待効用が最大となる選択候補フィードバ
ックが決定される。Expected utility value (selection feedback) = 0.6 ×
1.0+ (1-0.6) × 0 = 0.6 Expected utility value (selection candidate feedback) = 0.6 × 0.7 + (1-
0.6) × 0.6 = 0.66 Expected utility value (no feedback) = 0.6 × 0 + (1-0.6) ×
1.0 = 0.4 In this case, the selection candidate feedback that maximizes the expected utility is determined.

【００５４】図５に示すフィードバック動作は、選択フ
ィードバック、選択候補フィードバックが用意されてい
る。実際のフィードバック動作は、ウインドウシステム
上におけるアイコンの輝度・大きさ・形状変化または音
声出力により実現する。In the feedback operation shown in FIG. 5, selection feedback and selection candidate feedback are prepared. The actual feedback operation is realized by changing the brightness, size, shape of the icon on the window system, or outputting sound.

【００５５】さらにフィードバック生成部105 は、フィ
ードバック生成に先立ち視線検出エンジン・音声認識エ
ンジンに予測情報を送ることができる。この場合の予測
情報とは、選択対象が選択フィードバックの際には視線
が選択対象の方を向くというものであったり、また選択
対象に対する言及（名前・場所など）が行なわれるとい
うものである。各入力部は予測情報に基づき、各認識処
理中での処理内容や処理用データセットを切替える等の
処理を行なう。Further, the feedback generation unit 105 can send prediction information to the gaze detection engine / speech recognition engine prior to generation of feedback. The prediction information in this case is that the gaze is directed toward the selection target when the selection target is feedback, or that the selection target is referred to (name, location, etc.). Each input unit performs processing such as switching the processing contents and the processing data set during each recognition processing based on the prediction information.

【００５６】このように構成されたシステムでは、視線
・音声などの各モダリティ間の因果関係を学習し、それ
に基づいて意図検出が行なわれる。また各入力部も利用
者の予測行動に応じた処理を行なう。これにより利用者
・環境に動的に適応するインタフェースを簡単に構成す
ることができる。In the system configured as described above, the causal relationship between the modalities such as the line of sight and the voice is learned, and the intention is detected based on the learned result. Each input unit also performs a process according to the predicted behavior of the user. This makes it possible to easily configure an interface that dynamically adapts to the user and the environment.

【００５７】（変更例１）なお、本実施例では、アイコ
ンの選択を利用者の意図として設定しているが、実際は
これに限るものではなく、すべての対象の選択、コマン
ドの実行についてもそれぞれに対し図１のようなシステ
ム構成をとることにより同様に実現することが可能であ
る。(Modification 1) In this embodiment, the selection of the icon is set as the intention of the user. However, the present invention is not limited to this, and the selection of all the objects and the execution of the command are respectively performed. However, it can be similarly realized by adopting a system configuration as shown in FIG.

【００５８】（変更例２）また、図３，４において類似
度の分布情報にテーブルを用いているが、実際にはこれ
に限るものではなく、関数式のように連続した値を持つ
分布を想定してもよい。(Modification 2) In FIGS. 3 and 4, a table is used for similarity distribution information. However, the present invention is not limited to this, and a distribution having continuous values like a function expression may be used. It may be assumed.

【００５９】（変更例３）また、本実施例では、因果関
係情報の学習を図３，４に示す類似度分布情報の更新に
より実現しているが、実際にはこれに限るものではな
く、学習結果保存用に別のテーブル等の手段を用いても
よい。(Modification 3) In the present embodiment, the learning of the causal relationship information is realized by updating the similarity distribution information shown in FIGS. 3 and 4. However, the learning is not limited to this. A means such as another table may be used for storing the learning result.

【００６０】その場合には従来の類似度分布情報と学習
結果により得られた類似度分布情報に基づいて選択意図
が計算される。In this case, the selection intention is calculated based on the conventional similarity distribution information and the similarity distribution information obtained as a result of learning.

【００６１】これは例えば両者の類似度分布情報で同一
の項目を加算した結果に基づき選択意図確率を計算する
ことにより実現することができる。このような学習結果
を、利用者ごとに格納してもよい。This can be realized, for example, by calculating the selection intention probability based on the result of adding the same item with the similarity distribution information of both. Such a learning result may be stored for each user.

【００６２】（変更例４）また、本実施例では入力情報
として視線検出、音声認識、操作入力（マウス・キーボ
ードによる）を用いているが、必ずしもこれに限るもの
ではない。(Fourth Modification) In this embodiment, visual line detection, voice recognition, and operation input (using a mouse / keyboard) are used as input information. However, the present invention is not limited to this.

【００６３】それ以外の入力情報についても図３，４に
示すような入力情報間の因果関係に関するテーブルを構
成することにより処理を行なうことが可能である。The other input information can be processed by constructing a table relating to the causal relationship between the input information as shown in FIGS.

【００６４】（変更例５）また、本実施例においては、
フィードバック方法決定のために期待効用最大の原則を
用いているが、必ずしもこれに限るものではない。マク
シミン基準などの他の決定規則を用いても良い。(Modification 5) In the present embodiment,
We use the principle of maximum expected utility to determine the feedback method, but this is not necessarily so. Other decision rules, such as the maximin criterion, may be used.

【００６５】（変更例６）また、本実施例では、入力統
合部104 は現在の時刻の入力情報を用いることとしてい
るが、必ずしもこれに限るものではない。(Modification 6) In this embodiment, the input integration unit 104 uses the input information of the current time. However, the present invention is not limited to this.

【００６６】過去の時刻における入力情報を用いてもよ
い。その場合は過去の時刻における視線類似度・音声類
似度を保持しておき、図３，４のテーブルにおいて保持
していた過去の時刻の類似度を採用すればよい。また現
在の時刻と過去の時刻の類似度間の因果関係を図３，４
のテーブル状に表現することもできる。The input information at the past time may be used. In this case, the gaze similarity / speech similarity at the past time is stored, and the similarity at the past time stored in the tables of FIGS. FIGS. 3 and 4 show the causal relationship between the similarity between the current time and the past time.
Can be expressed in the form of a table.

【００６７】第２の実施例次に第２の実施例につき説明する。Second Embodiment Next, a second embodiment will be described.

【００６８】図６は、第２の実施例のシステムの全体ブ
ロック図である。FIG. 6 is an overall block diagram of the system according to the second embodiment.

【００６９】このシステムのうちの第１の操作対象は、
画像入力に基づく視線検出エンジン5001、音声入力に基
づく音声認識エンジン5002、マウス・キーボード等から
なる操作入力部5003、前記5001〜5003よりの入力を統合
し、利用者の意図を検出する入力統合部5004、意図検出
結果に基づき利用者に出力を行なうフィードバック生成
部5005を持つ。The first operation target of this system is:
A gaze detection engine 5001 based on image input, a voice recognition engine 5002 based on voice input, an operation input unit 5003 including a mouse and a keyboard, and an input integration unit that integrates inputs from the above 5001 to 5003 and detects a user's intention. 5004, and a feedback generation unit 5005 that outputs to the user based on the intention detection result.

【００７０】第２以降の操作対象は、それぞれ5101〜51
05、5201〜5205のように同様のユニットを持つ。各操作
対象の入力統合部・フィードバック生成部どうしは結合
されており、情報の交換を行なうことが可能である。The second and subsequent operation objects are 5101 to 511, respectively.
05, 5201-5205 have similar units. The input integration unit and the feedback generation unit of each operation object are connected to each other, and can exchange information.

【００７１】本実施例では、ウインドウシステムを対象
とし、操作対象は図２に示すアイコンの一つ一つとす
る。また本実施例でいう意図とは利用者の操作対象に対
する選択意図とする。In this embodiment, the window system is targeted, and the operation target is one of the icons shown in FIG. The intention in the present embodiment is a user's intention to select an operation target.

【００７２】（入力部）入力部5001〜5003，5101〜5103
等は、入力統合部5004、5104等に対し、操作対象と入力
情報を類似度に換算した情報を第１の実施例と同様な形
態で送るものとする。(Input unit) Input units 5001 to 5003, 5101 to 5103
In this case, information obtained by converting the operation target and the input information into the similarity is sent to the input integration units 5004 and 5104 in the same manner as in the first embodiment.

【００７３】（入力統合部）入力統合部5004、5104等で
は、各入力情報より、自らが選択されている確率を求め
る。また学習により、利用者や環境に適応してより確度
の高い意図選択を行なう。(Input Integration Unit) The input integration units 5004, 5104, etc., determine the probability that they are selected from each input information. In addition, through learning, more accurate intention selection is performed in accordance with the user and the environment.

【００７４】これは例えば以下の手法を用いることがで
きる。For this, the following method can be used, for example.

【００７５】図７〜１０に示すように、入力情報間の因
果関係をテーブルを用いて表現する。As shown in FIGS. 7 to 10, a causal relationship between input information is expressed using a table.

【００７６】図７〜１０では利用者の選択意図と、視線
検出エンジンから得られる類似度情報（視線類似度）と
音声認識エンジンから得られる類似度情報（音声類似
度）と各アイコン間の平均距離の因果関係を頻度情報と
して保持している。本実施例では、類似度はそれぞれ0
〜1 の間を当分割している（必ずしもこのようにする必
要はない）。7 to 10 show the user's intention to select, similarity information (gaze similarity) obtained from the gaze detection engine, similarity information (speech similarity) obtained from the speech recognition engine, and the average between the icons. The causal relationship of the distance is stored as frequency information. In this embodiment, the similarity is 0
This is divided into 1 (this is not always necessary).

【００７７】図７、図９では選択意図が視線類似度の原
因として表されるというモデルのもとに両者の関係を表
現している。選択意図のある場合をＰｏｓｉｔｉｖｅ，
ない場合をＮｅｇａｔｉｖｅと呼び、それぞれの場合の
視線類似度の分布を頻度情報として表現している。FIGS. 7 and 9 show the relationship between the two based on a model in which the selection intention is expressed as the cause of the gaze similarity. Positive,
The case where there is no gaze is called Negative, and the distribution of the gaze similarity in each case is expressed as frequency information.

【００７８】図８、図１１では視線類似度と選択意図が
音声類似度の原因となるというモデルのもとに３者の関
係を表現しており、選択意図、視線類似度が与えられた
場合の音声類似度の分布情報である。FIGS. 8 and 11 show the relationship between the three persons based on a model in which the line-of-sight similarity and the selection intention cause the voice similarity. In the case where the selection intention and the line-of-sight similarity are given. Is distribution information of speech similarity.

【００７９】また、図７、図８、図９、図１０ともにア
イコン間平均距離がその原因となるというモデルのも
と、それぞれ各５、１０ピクセルの場合について分布情
報が与えられている。In each of FIGS. 7, 8, 9 and 10, distribution information is given for each case of 5, 10 pixels based on the model that the average distance between icons is the cause.

【００８０】入力統合部では一定の時刻毎に各入力情報
を受けとる。また選択対象間の情報交換により選択対象
アイコン間の平均距離を求め、５、１０のうちの近い値
を採用する。The input integration unit receives each input information at a fixed time. Further, an average distance between the icons to be selected is obtained by exchanging information between the objects to be selected, and a closer value among 5, 10 is adopted.

【００８１】これらの値より、図３，４の表を利用して
選択意図のある場合の確率を求める。そのために以下の
式４を用いる。ここでＧ，Ｓは式１と同様にそれぞれ入
力情報より得られる視線類似度、音声類似度の値を示
す。Ｄはアイコン間距離を示す。From these values, the probability in the case where there is a selection intention is obtained using the tables of FIGS. Equation 4 below is used for that purpose. Here, G and S indicate the values of the line-of-sight similarity and the voice similarity obtained from the input information, respectively, as in Expression 1. D indicates the distance between icons.

【００８２】[0082]

【数５】次に因果関係情報の学習方式について説明する。(Equation 5) Next, a method of learning causal relationship information will be described.

【００８３】ユーザからキーボード・マウス等を用いて
類似度１の操作情報が与えられた場合、情報統合部はそ
の操作対象に対して選択意図の確率を直ちに１にすると
ともに、その際の視線類似度、音声類似度、アイコン間
距離の値をもとに、図７〜１０の類似度分布情報を更新
する。すなわち図７〜１０中の対応する項目に１を加え
る。例えばマウスにより選択が行なわれた際の視線類似
度、音声類似度をそれぞれ0.65,0.87 、アイコン間距離
を５ピクセルとすると、図７では視線類似度＝0.6 〜0.
7 、選択＝Ｐｏｓｉｔｉｖｅの欄に、図８では視線類似
度＝0.6 〜0.7 、選択＝Ｐｏｓｉｔｉｖｅ、音声類似度
＝0.8 〜0.9 の欄にそれぞれ１を加えることになる。When the user gives operation information with a similarity of 1 using a keyboard / mouse or the like, the information integration unit immediately sets the probability of the selection intention to 1 for the operation target and sets the line-of-sight similarity at that time. The similarity distribution information in FIGS. 7 to 10 is updated based on the values of the degree, the voice similarity, and the distance between the icons. That is, 1 is added to the corresponding item in FIGS. For example, assuming that the line-of-sight similarity and the voice similarity at the time of selection by the mouse are 0.65 and 0.87, respectively, and the distance between the icons is 5 pixels, the line-of-sight similarity = 0.6 to 0 in FIG.
7, 1 is added to the column of selection = Positive, and in FIG. 8, 1 is added to the column of gaze similarity = 0.6 to 0.7, selection = Positive, and voice similarity = 0.8 to 0.9.

【００８４】またその際に類似度１が与えられなかった
操作対象に対しては、選択意図の確率を直ちに０にする
とともに、その際の視線類似度、音声類似度、アイコン
間距離の値をもとに、図３，４の類似度分布情報を同様
に更新する。ただし対応する視線類似度、音声類似度に
ついて、選択＝Ｎｅｇａｔｉｖｅの欄に１を加えること
になる。For the operation object to which the similarity 1 was not given at that time, the probability of the selection intention is immediately set to 0, and the values of the visual line similarity, the voice similarity, and the distance between the icons at that time are changed. Based on this, the similarity distribution information in FIGS. 3 and 4 is similarly updated. However, 1 is added to the column of selection = Negative for the corresponding gaze similarity and voice similarity.

【００８５】このように、利用者・環境に適応して、選
択意図が確からしくなるように動的に学習を行なってい
くことが可能になる。As described above, it is possible to perform learning dynamically so as to be able to confirm the intention of selection in accordance with the user / environment.

【００８６】（フィードバック生成部）次にフィードバ
ック生成部5005、5105等の動作について説明する。(Feedback Generation Unit) Next, the operation of the feedback generation units 5005 and 5105 will be described.

【００８７】フィードバック生成部では、入力統合部50
04、5104等から送られた選択意図確率に基づいて、選択
対象アイコンが利用者に対して行なうフィードバックを
決定する。これは第１の実施例と同様に、例えば図５に
示す効用テーブルと式２を用いて行なうことができる。In the feedback generation section, the input integration section 50
Based on the selection intention probability transmitted from 04, 5104, or the like, the feedback that the selection target icon gives to the user is determined. This can be performed using, for example, the utility table shown in FIG. 5 and Equation 2 as in the first embodiment.

【００８８】またフィードバック生成部では、求めた各
選択対象の期待効用値に基づいて、アイコン間距離の値
を変更し、利用者へのフィードバックとすることができ
る。本実施例では、アイコン間距離５、１０ピクセルの
場合のそれぞれを仮定して各選択対象の期待効用値を求
める。全選択対象について、５、１０のときの期待効用
値の平均をとり、それが大きい方が期待効用値の大きい
選択対象間関係であると認定する。その結果に基づきア
イコン間距離を変更する。この際各選択対象のフィード
バック生成部間同士で情報交換を行ない、アイコン間距
離を指定値に近い値に調整する。これには例えば制約充
足プログラミング技術（人工知能学会Vol.10,No.3 を参
照）を用いることができる。Further, the feedback generation unit can change the value of the distance between icons based on the obtained expected utility value of each selection target, and can provide feedback to the user. In the present embodiment, the expected utility value of each selection target is obtained assuming a case where the distance between icons is 5, 10 pixels. The average of the expected utility values at 5 and 10 is taken for all the selected objects, and the larger the average is, the higher the expected utility value is, the higher is the relationship between the selected objects. The distance between icons is changed based on the result. At this time, information is exchanged between the feedback generation units to be selected, and the distance between the icons is adjusted to a value close to the specified value. For this, for example, a constraint satisfaction programming technique (see the Artificial Intelligence Society, Vol. 10, No. 3) can be used.

【００８９】さらにフィードバック生成部は、フィード
バック生成に先立ち視線検出エンジン・音声認識エンジ
ンに予測情報を送ることができる。この場合の予測情報
とは、選択対象が選択フィードバックの際には視線が選
択対象の方を向くというものであったり、また選択対象
に対する言及（名前・場所など）が行なわれるというも
のである。各入力部は予測情報に基づき、各認識処理中
での処理内容や処理用データセットを切替える等の処理
を行なう。Further, the feedback generation section can send prediction information to the gaze detection engine / speech recognition engine prior to the feedback generation. The prediction information in this case is that the gaze is directed toward the selection target when the selection target is feedback, or that the selection target is referred to (name, location, etc.). Each input unit performs processing such as switching the processing contents and the processing data set during each recognition processing based on the prediction information.

【００９０】（変更例１）なお、本実施例では、アイコ
ン間距離を選択対象間の関係として設定しているが、こ
れは一例であり、例えば選択対象間の形状的関係、色彩
的関係、言語的関係を設定してもよい。また選択対象間
の関係も今回の実施例のような離散値に限るものではな
く、関数式のように連続値をとるように設定しても良
い。(Modification 1) In this embodiment, the distance between the icons is set as the relationship between the selection targets. However, this is merely an example. A linguistic relationship may be set. Further, the relationship between the selection targets is not limited to the discrete value as in the present embodiment, and may be set to take a continuous value like a function expression.

【００９１】（変更例２）また、フィードバック生成部
は最適な選択対象間関係を求めるために全選択対象の期
待効用値の平均をとっているが、必ずしもこれに限るも
のではなく、部分的な選択対象間の期待効用値を利用し
ても良い。(Modification 2) The feedback generation unit averages the expected utility values of all the selection targets in order to obtain the optimum relationship between the selection targets. However, the present invention is not limited to this. The expected utility value between the selection targets may be used.

【００９２】（変更例３）また、本実施例では、因果関
係情報の学習を図７〜１０に示す類似度分布情報の更新
により実現しているが、実際にはこれに限るものではな
く、学習結果保存用に別のテーブル等の手段を用いても
よい。(Modification 3) In the present embodiment, the learning of the causal relationship information is realized by updating the similarity distribution information shown in FIGS. 7 to 10, but is not limited to this. A means such as another table may be used for storing the learning result.

【００９３】その場合には従来の類似度分布情報と学習
結果により得られた類似度分布情報に基づいて選択意図
が計算される。In this case, the selection intention is calculated based on the conventional similarity distribution information and the similarity distribution information obtained as a result of learning.

【００９４】これは例えば両者の類似度分布情報間で同
一の項目を加算した結果に基づき選択意図確率を計算す
ることにより実現することができる。このような学習結
果を、利用者ごと、または利用環境ごとに格納してもよ
い。このように本発明においては、その趣旨を逸脱しな
い範囲で種々の変形を行なうことが可能である。This can be realized, for example, by calculating the selection intention probability based on the result of adding the same item between the two similarity distribution information items. Such a learning result may be stored for each user or each usage environment. As described above, in the present invention, various modifications can be made without departing from the spirit of the present invention.

【００９５】第３の実施例次に第３の実施例について説明する。 Third Embodiment Next, a third embodiment will be described.

【００９６】全体ブロック図は第２の実施例と同様に図
６を用いる。図６は、システムの全体ブロック図であ
る。FIG. 6 is used for the entire block diagram similarly to the second embodiment. FIG. 6 is an overall block diagram of the system.

【００９７】このシステムのうちの第１の操作対象は、
画像入力に基づく視線検出エンジン5001、音声入力に基
づく音声認識エンジン5002、マウス・キーボード等から
なる操作入力部5003、前記5001〜5003よりの入力を統合
し、利用者の意図を検出する入力統合部5004、意図検出
結果に基づき利用者に出力を行なうフィードバック生成
部5005を持つ。The first operation target of this system is:
A gaze detection engine 5001 based on image input, a voice recognition engine 5002 based on voice input, an operation input unit 5003 including a mouse and a keyboard, and an input integration unit that integrates inputs from the 5001 to 5003 and detects an intention of the user 5004, and a feedback generation unit 5005 that outputs to the user based on the intention detection result.

【００９８】第２以降の操作対象は、それぞれ5101〜51
05、5201〜5205のように同様のユニットを持つ。The second and subsequent operation objects are 5101 to 511, respectively.
05, 5201-5205 have similar units.

【００９９】各操作対象の入力統合部・フィードバック
生成部どうしは結合されており、情報の交換を行なうこ
とが可能である。本実施例では、ウインドウシステムを
対象とし、操作対象は図２に示すアイコンの一つ一つと
する。また本実施例でいう意図とは利用者の操作対象に
対する選択意図とする。The input integration unit and the feedback generation unit of each operation object are connected to each other, and can exchange information. In this embodiment, the window system is targeted, and the operation target is one of the icons shown in FIG. The intention in the present embodiment is a user's intention to select an operation target.

【０１００】（入力部）入力部5001〜5003，5101〜5103
等は、入力統合部5004、5104等に対し、操作対象と入力
情報を類似度に換算した情報を第１の実施例と同様な形
態で送るものとする。(Input unit) Input units 5001 to 5003, 5101 to 5103
In this case, information obtained by converting the operation target and the input information into the similarity is sent to the input integration units 5004 and 5104 in the same manner as in the first embodiment.

【０１０１】（入力統合部）入力統合部5004、5104等で
は、各入力情報より、自らが選択されている確率を求め
る。また学習により、利用者や環境に適応してより確度
の高い意図選択を行なう。(Input Integration Unit) The input integration units 5004, 5104, etc., determine the probability that they are selected from each input information. In addition, through learning, more accurate intention selection is performed in accordance with the user and the environment.

【０１０２】これは例えば以下の手法を用いることがで
きる。第１の実施例と同様に、図３，４に示すように、
入力情報間の因果関係をテーブルを用いて表現する。For this, the following method can be used, for example. As in the first embodiment, as shown in FIGS.
The causal relationship between input information is expressed using a table.

【０１０３】図３，４では利用者の選択意図と、視線検
出エンジンから得られる類似度情報（視線類似度）と音
声認識エンジンから得られる類似度情報（音声類似度）
との因果関係を頻度情報として保持している。本実施例
では、類似度はそれぞれ0 〜1 の間を当分割している
（必ずしもこのようにする必要はない）。In FIGS. 3 and 4, the user's intention of selection, similarity information obtained from the gaze detection engine (gaze similarity), and similarity information obtained from the speech recognition engine (speech similarity) are shown.
Is held as frequency information. In this embodiment, the degree of similarity is divided between 0 and 1 (this is not always necessary).

【０１０４】図３では選択意図が視線類似度の原因とし
て表されるというモデルのもとに両者の関係を表現して
いる。選択意図のある場合をＰｏｓｉｔｉｖｅ，ない場
合をＮｅｇａｔｉｖｅと呼び、それぞれの場合の視線類
似度の分布を頻度情報として表現している。In FIG. 3, the relationship between the two is expressed based on a model in which the selection intention is expressed as the cause of the gaze similarity. The case where there is a selection intention is called Positive, and the case where there is no intention of selection is called Negative, and the distribution of gaze similarity in each case is expressed as frequency information.

【０１０５】図４では視線類似度と選択意図が音声類似
度の原因となるというモデルのもとに３者の関係を表現
しており、選択意図、視線類似度が与えられた場合の音
声類似度の分布情報である。In FIG. 4, the relationship between the three is expressed based on a model in which the line-of-sight similarity and the selection intention cause the voice similarity. Degree distribution information.

【０１０６】入力統合部では一定の時刻毎に各入力情報
を受けとる。また選択対象間の情報交換により選択対象
アイコン間の平均距離を求め、５、１０のうちの近い値
を採用する。これらの値より、図３，４の表を利用して
選択意図のある場合の確率を求める。そのために以下の
式４を用いる。ここでＧ，Ｓは式１と同様にそれぞれ入
力情報より得られる視線類似度、音声類似度の値を示
す。Ｄはアイコン間距離を示す。The input integration unit receives each input information at fixed time intervals. Further, an average distance between the icons to be selected is obtained by exchanging information between the objects to be selected, and a closer value among 5, 10 is adopted. From these values, the probability in the case where there is a selection intention is obtained using the tables of FIGS. Equation 4 below is used for that purpose. Here, G and S indicate the values of the line-of-sight similarity and the voice similarity obtained from the input information, respectively, as in Expression 1. D indicates the distance between icons.

【０１０７】[0107]

【数６】ここでδは0 以上1 未満の実数、Ｐ_-1は１単位時間前に
得られた選択意図確率とする。式５を用いることによ
り、選択意図の確率が過去の選択意図を反映したものに
なり、よりスムーズな意図情報の検出が可能になる。(Equation 6) Here, δ is a real number greater than or equal to 0 and less than 1, and P ₋₁ is a selection intention probability obtained one unit time ago. By using Expression 5, the probability of the selection intention reflects the past selection intention, and it is possible to detect the intention information more smoothly.

【０１０８】また、入力統合部5004、5104等において
は、第１の実施例と同様の学習を行う。その際に、得ら
れて意図情報の結果を学習開始、終了のトリガとする。
これはたとえば、式５により得られた選択意図確率に対
して閾値Ｘを設け、Ｐ（選択意図＝Ｐｏｓｉｔｉｖｅ｜
Ｇ，Ｓ，Ｄ）≧Ｘの場合に学習を開始し、Ｐ（選択意図
＝Ｐｏｓｉｔｉｖｅ｜Ｇ，Ｓ，Ｄ）＜Ｘの場合に選択を
終了するようにする。これにより、マウス・キーボード
などを用いた明示的な学習開始・終了の信号が得られな
い場合でも学習を行うことができ、より因果関係情報が
得やすくなるという利点がある。In the input integrating units 5004 and 5104, learning similar to that in the first embodiment is performed. At that time, the result of the intention information obtained is used as a trigger for learning start and end.
For example, a threshold value X is provided for the selection intention probability obtained by Expression 5, and P (selection intention = Positive |
The learning is started when (G, S, D) ≧ X, and the selection is ended when P (selection intention = Positive | G, S, D) <X. Accordingly, learning can be performed even when an explicit learning start / end signal using a mouse / keyboard or the like cannot be obtained, and there is an advantage that causal relationship information is more easily obtained.

【０１０９】（フィードバック生成部）またフィードバ
ック生成部5005、5105等において、第１の実施例と同様
の方法でフィードバックを決定するが、これを前記の入
力統合部における学習中にも並行して行うようにする。(Feedback Generation Unit) In the feedback generation units 5005, 5105, etc., feedback is determined in the same manner as in the first embodiment, but this is also performed during learning in the input integration unit. To do.

【０１１０】これは入力統合部より、学習中においても
式５に基づいて選択意図確率を求め、フィードバック生
成部に順次送り、フィードバック生成部は選択意図確率
を受取りしだいフィードバック決定・実現を行うことに
より実現可能である。これにより学習時においても利用
者は自らの意図が正しく学習結果に反映されているかど
うかを確認することができ、その後の操作をより円滑に
進めることができるという利点がある。This is done by obtaining the selection intention probability from the input integration unit based on Equation 5 even during learning, and sequentially sending it to the feedback generation unit. The feedback generation unit determines and implements the feedback upon receiving the selection intention probability. It is feasible. Thus, even during learning, the user can confirm whether his or her intention is correctly reflected in the learning result, and there is an advantage that subsequent operations can be performed more smoothly.

【０１１１】（変更例１）ここで入力統合部5004、5104
等において、特定の入力部より得られる情報のうち少な
くとも一つを意図検出結果の確認または取り消しに用い
てもよい。(Modification 1) Here, the input integration units 5004 and 5104
For example, at least one of information obtained from a specific input unit may be used for confirming or canceling the intention detection result.

【０１１２】これは、例えば以下の手順により実現する
ことが可能である。This can be realized, for example, by the following procedure.

【０１１３】１）図１１のようなテーブルを準備して
おき、入力情報到着時に確認類似度条件または取消類似
度条件に合致するかどうかを調べる。1) A table as shown in FIG. 11 is prepared, and when input information arrives, it is checked whether or not the confirmation similarity condition or the cancellation similarity condition is met.

【０１１４】ここで「マウス右」は操作対象上にマウス
カーソルが置かれた状態でマウスの右ボタンがクリック
されたならば類似度を１にセットされる入力情報、「マ
ウス左」は操作対象上にマウスカーソルが置かれた状態
でマウスの左ボタンがクリックされたならば類似度を１
にセットされる入力情報とする。Here, "mouse right" is input information for setting the similarity to 1 if the right mouse button is clicked with the mouse cursor placed on the operation target, and "mouse left" is the operation target If the left mouse button is clicked with the mouse cursor placed on top, the similarity is set to 1
Is the input information set in.

【０１１５】また確認音声は「はい」「ＯＫ」等の特定
の音声入力との認識結果のうち最大の類似度を入力情報
とするものとし、取消音声は「いいえ」「Ｎｏ」等の特
定の音声入力との認識結果のうち最大の類似度を入力情
報とするものとする。The confirmation voice is assumed to be the maximum similarity among the recognition results with a specific voice input such as "Yes" or "OK" as input information, and the cancellation voice is a specific voice such as "No" or "No". It is assumed that the maximum similarity among the recognition results with the voice input is input information.

【０１１６】２）上記の条件にマッチした場合にはフ
ィードバック生成部に確認または取消信号を送出する。2) If the above conditions are met, a confirmation or cancellation signal is sent to the feedback generator.

【０１１７】３）フィードバック生成部では、確認ま
たは取消信号に応じた処理を行う。3) The feedback generation section performs processing according to the confirmation or cancellation signal.

【０１１８】これは例えば確認信号を受け取った場合は
最大の期待効用を持つ操作対象を利用者の意図と判定し
選択フィードバックを行い、取消信号を受け取った場合
にはすべての操作対象に対して期待効用値を0 にセット
し、フィードバックを行わないようにすることによる実
現可能である。For example, when the confirmation signal is received, the operation object having the maximum expected utility is determined as the intention of the user, and selection feedback is performed. When the cancellation signal is received, all the operation objects are expected. This can be realized by setting the utility value to 0 and not performing feedback.

【０１１９】上記の拡張により、ユーザが自らの意図を
直接的にシステム側の意図検出に反映させることが可能
となり、より利便性の高いインタフェースを構成するこ
とが可能となる。With the above-mentioned expansion, the user can directly reflect his / her own intention in the intention detection on the system side, and a more convenient interface can be configured.

【０１２０】（変更例２）なお、本実施例では入力統合
部における学習を現在時刻の類似度情報を用いて行って
いるが、必ずしもこれに限るものではなく、過去の類似
度情報を使用しても良い。それは入力統合部にバッファ
を設け、過去の類似度情報を蓄積しておくことにより可
能である。(Modification 2) In the present embodiment, the learning in the input integration unit is performed using the similarity information at the current time. However, the present invention is not limited to this, and the past similarity information is used. May be. This can be achieved by providing a buffer in the input integration unit and storing past similarity information.

【０１２１】ここで例えば入力部より得たマウス・キー
ボードの操作情報または入力統合部より得られる意図情
報に基づいて過去の時点における選択意図のあるなしを
判断し、蓄積した過去の類似度情報を図３，４に示す因
果関係テーブルに反映させることのより可能である。Here, for example, it is determined whether or not there is a selection intention at a past time based on mouse / keyboard operation information obtained from an input unit or intention information obtained from an input integration unit, and the accumulated past similarity information is determined. It is more possible to reflect them in the causal relationship tables shown in FIGS.

【０１２２】このように本発明においては、その趣旨を
逸脱しない範囲で種々の変形を行なうことが可能であ
る。As described above, in the present invention, various modifications can be made without departing from the spirit of the present invention.

【０１２３】[0123]

【発明の効果】本発明によれば、利用者のマルチモーダ
ル入力に対して、利用者からの入力情報間の因果関係情
報の動的学習により、利用者の自由な入力を許し、また
環境が変化した場合にも利用者意図の検出が確からしく
なるようにシステムが適応することができる。また利用
対象間の関係を変化させることにより、より意図の検出
を確からしくするような環境を構成することができる。According to the present invention, free multi-modal input by the user is allowed by dynamic learning of the causal relationship information between the input information from the user, and the environment can be freed. The system can be adapted so that the detection of the user's intention becomes reliable even when it changes. Further, by changing the relationship between the objects to be used, it is possible to configure an environment in which the intention can be more reliably detected.

[Brief description of the drawings]

【図１】本発明の第１の実施例のブロック図である。FIG. 1 is a block diagram of a first embodiment of the present invention.

【図２】第１、第２及び第３の実施例で用いるタスクの
一例である。FIG. 2 is an example of a task used in the first, second, and third embodiments.

【図３】第１及び第３の実施例で用いる入力情報統合用
テーブルの一例である。FIG. 3 is an example of an input information integration table used in the first and third embodiments.

【図４】第１及び第３の実施例で用いる入力情報統合用
テーブルの一例である。FIG. 4 is an example of an input information integration table used in the first and third embodiments.

【図５】第１、第２及び第３の実施例で用いるフィード
バック生成用テーブルの一例である。FIG. 5 is an example of a feedback generation table used in the first, second, and third embodiments.

【図６】第２及び第３の実施例のブロック図である。FIG. 6 is a block diagram of the second and third embodiments.

【図７】第２の実施例で用いる入力情報統合用テーブル
の一例である。FIG. 7 is an example of an input information integration table used in the second embodiment.

【図８】第２の実施例で用いる入力情報統合用テーブル
の一例である。FIG. 8 is an example of an input information integration table used in the second embodiment.

【図９】第２の実施例で用いる入力情報統合用テーブル
の一例である。FIG. 9 is an example of an input information integration table used in the second embodiment.

【図１０】第２の実施例で用いる入力情報統合用テーブ
ルの一例である。FIG. 10 is an example of an input information integration table used in the second embodiment.

【図１１】第３の実施例で用いる意図確認・取消処理に
用いるテーブルの一例である。FIG. 11 is an example of a table used for intention confirmation / cancellation processing used in the third embodiment.

[Explanation of symbols]

１０１視線検出エンジン１０２音声認識エンジン１０３操作入力部１０４入力統合部１０５フィードバック生成部 101 Eye-gaze detection engine 102 Voice recognition engine 103 Operation input unit 104 Input integration unit 105 Feedback generation unit

Claims

[Claims]

1. Various operation objects having predetermined processing contents are prepared in advance, and a user's selection instruction for these operation objects is provided by user's line-of-sight input information, voice input information, operation input information, image input information. Based on the information obtained by recognizing at least one or more of the information and the operation input information, the intention of the user is detected, and the operation target corresponding to the operation target is detected based on the detection result. An interface device of a multi-modal input / output device configured to return a required presentation to a user in order to notify that the selection instruction has been recognized, wherein the intention is detected using causal relationship information between input information. A multimodal input / output device characterized in that the causal relationship information is statistically learned through actual input / output with a user. Interface device.

2. An interface device for a multi-modal input / output device according to claim 1, wherein said integrating means has time managing means for managing time and holding means for holding input information at present and past times. An interface device for a multimodal input / output device, wherein the causal relationship information uses current and past input information held by the holding means.

3. The interface device for a multi-modal input / output device according to claim 1, wherein said integrating means transmits the causal relationship information between the input information or between the input information and the input time to the user. An interface device for a multi-modal input / output device characterized by learning statistically through actual input / output.

4. Various operation objects whose processing contents are determined are prepared in advance, and a user's selection instruction for these operation objects is input by the user's gaze input information, voice input information, operation input information, image input information. Based on the information obtained by recognizing at least one or more of the information and the operation input information, the intention of the user is detected, and the operation target corresponding to the operation target is detected based on the detection result. In a multimodal input / output device interface device for returning a required presentation to a user in order to notify that the selection instruction has been recognized, a multimodal interface is provided independently for each operation target. Interface device for input / output devices.

5. An interface device for a multi-modal input / output device according to claim 3, wherein said integrating means converts a result of intention detection in the specific operation target into a position, shape, color, etc. with another operation target group. An interface device for a multi-modal input / output device, which is determined based on a linguistic relationship.

6. The interface device for a multi-modal input / output device according to claim 3, wherein the integrating means determines the specific operation target based on its own intention detection result and the intention detection result of another operation target group. An interface device for a multi-modal input / output device characterized by changing a positional, shape, color, and linguistic relationship with a group of operation targets.

7. The multimodal input / output device interface device according to claim 1, wherein the integration means returns the intention information to the user at a past time or to the user. An interface device for a multi-modal input / output device, wherein the interface device is obtained from the presentation result.

8. The multimodal input / output device interface device according to claim 1, wherein learning of causal relationship information is started or ended based on intention information obtained from said integrating means. Output device interface device.

9. The multimodal input / output device according to claim 1, wherein the output for returning a required presentation to the user by the operation target is performed at the time of the learning. Output device interface device.

10. The interface device for a multi-modal input / output device according to claim 1, wherein said integration means includes at least one of said various input information as information for confirming or canceling an intention detection result. An interface device for a multi-modal input / output device, which is used.

11. Various operation objects having predetermined processing contents are prepared in advance, and a user's selection instruction for these operation objects is provided by user's line-of-sight input information, voice input information, operation input information, image input information. Based on the information obtained by recognizing at least one or more of the information and the operation input information, the intention of the user is detected, and the operation target corresponding to the operation target is detected based on the detection result. An interface method of a multi-modal input / output device for returning a required presentation to a user to notify that the selection instruction has been recognized, wherein the intention is detected using causal relationship information between input information. An interface method for a multimodal input / output device, wherein the causal relationship information is statistically learned through actual input / output with a user.

12. The multi-modal input / output device interface method according to claim 7, wherein the learning is performed by statistically analyzing causal relationship information between input information or between input information and input time through actual input / output with a user. A method for interfacing a multimodal input / output device, the method comprising: