WO2021140642A1 - 視線推定装置、視線推定方法、モデル生成装置、及びモデル生成方法 - Google Patents
視線推定装置、視線推定方法、モデル生成装置、及びモデル生成方法 Download PDFInfo
- Publication number
- WO2021140642A1 WO2021140642A1 PCT/JP2020/000643 JP2020000643W WO2021140642A1 WO 2021140642 A1 WO2021140642 A1 WO 2021140642A1 JP 2020000643 W JP2020000643 W JP 2020000643W WO 2021140642 A1 WO2021140642 A1 WO 2021140642A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- learning
- line
- extractor
- sight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/18—Eye characteristics, e.g. of the iris
- G06V40/193—Preprocessing; Feature extraction
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/98—Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
- G06V10/993—Evaluation of the quality of the acquired pattern
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/18—Eye characteristics, e.g. of the iris
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
Definitions
- the present invention relates to a line-of-sight estimation device, a line-of-sight estimation method, a model generation device, and a model generation method.
- the corneal reflex method is known as an example of a method for estimating the line-of-sight direction.
- a bright spot Purkinje image
- the line of sight is estimated based on the positional relationship between the generated bright spot and the pupil.
- the line-of-sight direction can be estimated with high accuracy regardless of the orientation of the face or the like.
- it is difficult to estimate the line-of-sight direction unless bright spots can be generated on the cornea. Therefore, the range in which the line-of-sight direction can be estimated is limited.
- the estimation accuracy in the line-of-sight direction may deteriorate due to the influence of the fluctuation of the position of the head.
- a method using the pupil shape is known.
- the shape of the eyeball is regarded as a sphere
- the contour of the pupil is regarded as a circle
- the apparent shape of the pupil becomes elliptical as the eyeball moves. That is, in this method, the pupil shape of the subject shown in the captured image is fitted, and the line-of-sight direction is estimated based on the inclination of the obtained pupil shape (ellipse) and the ratio of the major axis to the minor axis.
- the calculation method is simple, the processing cost for estimating the line-of-sight direction can be reduced and the processing can be speeded up.
- the pupil shape cannot be obtained accurately, the estimation accuracy in the line-of-sight direction may deteriorate. Therefore, if the image of the pupil in the obtained image has a low resolution due to reasons such as the head being separated from the photographing device and the performance of the photographing device being low, it is difficult to fit the pupil shape. Therefore, it may be difficult to estimate the line-of-sight direction.
- Patent Document 1 proposes a method of estimating the line-of-sight direction using a trained model such as a neural network.
- a partial image showing the eyes is extracted from the captured image obtained by photographing the face of the subject, and the trained model is used to extract the subject from the extracted partial image.
- the inventors of the present invention have found that the conventional method has the following problems. That is, it is known that the fovea centralis exists in the center of the human retina, and this fovea contributes to vision in a high-definition central visual field. Therefore, the human line-of-sight direction can be defined by the line connecting the fovea centralis and the center of the pupil. There are individual differences in the position of this fovea. That is, the fovea is not always located in the perfect center of the retina, and its position may vary depending on the individual. It is difficult to identify the position of the fovea centralis of each individual from the captured image obtained by the imaging device.
- the conventional method a model for estimating the line-of-sight direction is constructed based on the data obtained from the subject.
- the conventional method has a problem that the estimation accuracy in the line-of-sight direction may deteriorate due to the individual difference in the position of the fovea.
- the present invention on the one hand, was made in view of such circumstances, and an object of the present invention is to provide a technique capable of estimating the line-of-sight direction of a subject with high accuracy.
- the present invention adopts the following configuration in order to solve the above-mentioned problems.
- the line-of-sight estimation device has characteristic information regarding the line of sight of the target person looking in a predetermined direction, and a true value indicating the true value of the target person's eyes in the predetermined direction.
- the information acquisition unit that acquires calibration information including information, the image acquisition unit that acquires the target image in which the target person's eyes are captured, and the learned estimation model generated by machine learning are used to capture the target image.
- It is trained to output an output value that matches the correct answer information indicating the true value of the subject's line-of-sight direction in the image, and estimating the line-of-sight direction is performed by using the acquired target image and the calibration information.
- the output value corresponding to the result of estimating the line-of-sight direction of the target person reflected in the target image is obtained from the trained estimation model. It includes an estimation unit configured by acquiring, and an output unit that outputs information about the result of estimating the line-of-sight direction of the target person.
- the feature information relates to the line of sight of the eyes of the subject looking in a predetermined direction.
- the true value information indicates the true value in the predetermined direction. According to the feature information and the true value information, it is possible to grasp the characteristics of the eyes (that is, the individuality of the line of sight of the subject) forming the line of sight in a known direction from the true value. Therefore, according to this configuration, it is possible to calibrate the difference in the line-of-sight direction due to the individual difference between the subject and the subject by further using the calibration information for estimating the line-of-sight direction. That is, the line-of-sight direction of the subject can be estimated in consideration of individual differences. Therefore, it is possible to improve the accuracy of estimating the line-of-sight direction of the target person.
- the calibration information may include the feature information and the true value information corresponding to each of a plurality of different predetermined directions. According to this configuration, since the individuality of the line-of-sight of the target person can be grasped more accurately from the calibration information for a plurality of different directions, the accuracy of estimating the line-of-sight direction of the target person can be further improved.
- the inclusion of the feature information and the true value information is configured by including a calibration feature amount related to calibration derived by combining the feature information and the true value information. It's okay.
- the trained estimation model may include a first extractor and an estimator. To execute the arithmetic processing of the trained estimation model, the acquired target image is input to the first extractor, and the arithmetic processing of the first extractor is executed. Obtaining an output value corresponding to the feature amount from the first extractor, inputting the calibration feature amount and the acquired first feature amount to the estimator, and executing arithmetic processing of the estimator. , May be composed of.
- this configuration it is possible to provide a trained estimation model capable of appropriately estimating the line-of-sight direction of the target person from the target image and the calibration information. Further, according to the configuration, by reducing the amount of calibration information, the cost of information processing for estimating the line-of-sight direction of the target person can be reduced, thereby speeding up the information processing. Can be done.
- the feature information may be composed of a second feature amount relating to a reference image in which the eyes of the subject looking in the predetermined direction are captured.
- the information acquisition unit may have a coupler. To acquire the calibration information, the second feature amount is acquired, the true value information is acquired, and the acquired second feature amount and the true value information are input to the coupler. By executing the arithmetic processing of the coupler, the output value corresponding to the calibration feature amount may be obtained from the coupler. According to this configuration, the arithmetic processing for deriving the calibration feature amount is executed in the calibration information acquisition process, not in the line-of-sight direction estimation process. Therefore, the processing cost of the estimation processing can be suppressed.
- the acquisition process of the target image and the estimation process in the line-of-sight direction are repeatedly executed, if the calibration feature amount has been derived, the already derived calibration feature amount is reused in the repeated calculation to obtain the calibration information. Execution of acquisition process can be omitted. Therefore, the cost of a series of arithmetic processing can be reduced, and thereby the speed of the series of arithmetic processing can be increased.
- the information acquisition unit may further include a second extractor.
- the reference image is acquired, and the acquired reference image is input to the second extractor, and the arithmetic processing of the second extractor is executed. It may be configured by acquiring an output value corresponding to the second feature amount from the second extractor. According to this configuration, it is possible to appropriately acquire feature information (second feature amount) showing the features of the line of sight of the target person looking in a predetermined direction.
- the trained estimation model may include a first extractor and an estimator.
- the acquired target image is input to the first extractor, and the arithmetic processing of the first extractor is executed.
- the output value corresponding to the feature amount is acquired from the first extractor, and the feature information, the true value information, and the acquired first feature amount are input to the estimator, and the calculation of the estimator is performed. It may be configured by executing a process. According to this configuration, it is possible to provide a trained estimation model capable of appropriately estimating the line-of-sight direction of the target person from the target image and the calibration information.
- the feature information may be composed of a second feature amount relating to a reference image in which the eyes of the subject looking in the predetermined direction are captured.
- the information acquisition unit may have a second extractor.
- the second is obtained by acquiring the reference image, inputting the acquired reference image to the second extractor, and executing the arithmetic processing of the second extractor. It may be configured by acquiring the output value corresponding to the feature amount from the second extractor and acquiring the true value information. According to this configuration, it is possible to appropriately acquire feature information (second feature amount) showing the features of the line of sight of the target person looking in a predetermined direction.
- the acquisition process of the target image and the estimation process in the line-of-sight direction are repeatedly executed, if the second feature amount has been derived, the already derived second feature amount is reused and calibrated in the iterative calculation. It is possible to omit the execution of the information acquisition process. Therefore, it is possible to reduce the cost of a series of arithmetic processes for designating the line-of-sight direction of the target person, and thereby speed up the series of arithmetic operations.
- the feature information may be composed of a reference image in which the eyes of the subject looking in the predetermined direction are captured.
- the trained estimation model may include a first extractor, a second extractor, and an estimator.
- the acquired target image is input to the first extractor, and the arithmetic processing of the first extractor is executed.
- the output value corresponding to the feature amount from the first extractor, inputting the reference image to the second extractor, and executing the arithmetic processing of the second extractor, the second with respect to the reference image.
- the output value corresponding to the two feature quantities is acquired from the second extractor, and the acquired first feature quantity, the acquired second feature quantity, and the true value information are input to the estimator.
- it may be configured by executing the arithmetic processing of the estimator. According to this configuration, it is possible to provide a trained estimation model capable of appropriately estimating the line-of-sight direction of the target person from the target image and the calibration information.
- the trained estimation model may include a first converter and an estimater.
- the acquired target image is input to the first converter, and the arithmetic processing of the first converter is executed, so that the line-of-sight direction of the target person is executed.
- the output value corresponding to the first heat map relating to the above is acquired from the first converter, and the acquired first heat map, the feature information, and the true value information are input to the estimator, and the said It may be configured by executing arithmetic processing of the estimator. According to this configuration, it is possible to provide a trained estimation model capable of appropriately estimating the line-of-sight direction of the target person from the target image and the calibration information.
- the feature information is derived from a reference image in which the eyes of the subject looking in the predetermined direction are captured, and the second feature information is related to the line-of-sight direction of the eyes looking in the predetermined direction. It may consist of a heat map.
- the information acquisition unit may have a second converter. Acquiring the calibration information means acquiring the reference image, inputting the acquired reference image to the second converter, and executing arithmetic processing of the second converter. Acquiring the output value corresponding to the heat map from the second converter, acquiring the true value information, and converting the true value information into a third heat map relating to the true value in the predetermined direction. It may be composed of.
- Inputting the first heat map, the feature information, and the true value information to the estimator causes the first heat map, the second heat map, and the third heat map to be input to the estimator. It may be configured by. According to this configuration, by adopting a common heat map format as the data format on the input side, the configuration of the estimator can be made relatively simple, and each information (feature information, true value) in the estimator can be made relatively simple. By facilitating the integration of information and target images), it can be expected that the estimation accuracy of the estimator will be improved.
- the image acquisition unit may repeatedly acquire the target image and the estimation unit may repeatedly estimate the line-of-sight direction of the target person. According to this configuration, it is possible to continuously estimate the line-of-sight direction of the subject.
- the information acquisition unit outputs the instruction to the target person to look in a predetermined direction, and then observes the line-of-sight of the target person with a sensor to obtain the calibration information. You may get it. According to this configuration, it is possible to appropriately and easily acquire calibration information showing the individuality of the line of sight of the subject.
- One aspect of the present invention may be an apparatus that generates a trained estimation model that can be used in the line-of-sight estimation apparatus according to each of the above embodiments.
- the model generator according to one aspect of the present invention is for learning feature information regarding the line of sight of the eyes of a subject looking in a predetermined direction and for learning showing the true value of the eyes of the subject in the predetermined direction.
- a second acquisition unit that acquires a plurality of learning data sets configured by the above, and a machine learning unit that performs machine learning of an estimation model using the acquired plurality of learning data sets. To train the estimation model so as to output an output value corresponding to the corresponding correct answer information for the input of the learning target image and the learning calibration information for each of the training datasets. It is provided with a machine learning unit, which is composed of the above.
- one aspect of the present invention may be an information processing method or a program that realizes each of the above configurations.
- a storage medium that stores such a program and can be read by a computer or the like may be used.
- the storage medium that can be read by a computer or the like is a medium that stores information such as a program by electrical, magnetic, optical, mechanical, or chemical action.
- the line-of-sight estimation system according to one aspect of the present invention may be composed of a line-of-sight estimation device and a model generation device according to any one of the above forms.
- the computer obtains characteristic information regarding the line-of-sight of the target person looking in a predetermined direction and the true value of the predetermined direction seen by the subject's eyes.
- the step of acquiring calibration information including the indicated true value information the step of acquiring the target image in which the target person's eyes are captured, and the learned estimation model generated by machine learning, the target to be reflected in the target image.
- the trained estimation model is the learning calibration information and the learning target image obtained from the subject by the machine learning
- the learning calibration information is the calibration.
- the true value of the subject's line-of-sight direction reflected in the learning target image with respect to the input of the learning calibration information and the learning target image, which are the same type as the information and the learning target image is the same type as the target image. It is trained to output an output value that matches the correct answer information indicating that, and estimating the line-of-sight direction is performed by inputting the acquired target image and the calibration information into the trained estimation model and performing the training.
- a step and a step composed of acquiring an output value corresponding to the result of estimating the line-of-sight direction of the target person reflected in the target image from the trained estimation model by executing arithmetic processing of the completed estimation model. It is an information processing method that executes a step of outputting information regarding the result of estimating the line-of-sight direction of the target person.
- learning feature information regarding the line of sight of the subject's eyes looking in a predetermined direction and true of the predetermined direction seen by the subject's eyes The step of acquiring the learning calibration information including the learning true value information indicating the value, the learning target image in which the subject's eyes are captured, and the correct answer information indicating the true value in the line-of-sight direction of the subject in the learning target image.
- the line-of-sight direction of the subject can be estimated with high accuracy.
- FIG. 1 schematically illustrates an example of a situation in which the present invention is applied.
- FIG. 2 schematically illustrates an example of the hardware configuration of the model generator according to the embodiment.
- FIG. 3 schematically illustrates an example of the hardware configuration of the line-of-sight estimation device according to the embodiment.
- FIG. 4A schematically illustrates an example of the software configuration of the model generator according to the embodiment.
- FIG. 4B schematically illustrates an example of the software configuration of the model generator according to the embodiment.
- FIG. 5A schematically illustrates an example of the software configuration of the line-of-sight estimation device according to the embodiment.
- FIG. 5B schematically illustrates an example of the software configuration of the line-of-sight estimation device according to the embodiment.
- FIG. 6 shows an example of the processing procedure of the model generator according to the embodiment.
- FIG. 7 shows an example of the processing procedure of the line-of-sight estimation device according to the embodiment.
- FIG. 8 schematically illustrates an example of a scene in which calibration information according to an embodiment is acquired.
- FIG. 9 schematically illustrates an example of the software configuration of the model generator according to the modified example.
- FIG. 10 schematically illustrates an example of the software configuration of the line-of-sight estimation device according to the modified example.
- FIG. 11 schematically illustrates an example of the software configuration of the model generator according to the modified example.
- FIG. 12 schematically illustrates an example of the software configuration of the line-of-sight estimation device according to the modified example.
- FIG. 12 schematically illustrates an example of the software configuration of the line-of-sight estimation device according to the modified example.
- FIG. 13A schematically illustrates an example of the software configuration of the model generator according to the modified example.
- FIG. 13B schematically illustrates an example of the software configuration of the model generator according to the modified example.
- FIG. 14 schematically illustrates an example of the software configuration of the line-of-sight estimation device according to the modified example.
- the present embodiment an embodiment according to one aspect of the present invention (hereinafter, also referred to as “the present embodiment”) will be described with reference to the drawings.
- the embodiments described below are merely examples of the present invention in all respects. Needless to say, various improvements and modifications can be made without departing from the scope of the present invention. That is, in carrying out the present invention, a specific configuration according to the embodiment may be appropriately adopted.
- the data appearing in the present embodiment is described in natural language, more specifically, it is specified in a pseudo language, a command, a parameter, a machine language, etc. that can be recognized by a computer.
- FIG. 1 schematically illustrates an example of a situation in which the present invention is applied.
- the line-of-sight estimation system 100 includes a model generation device 1 and a line-of-sight estimation device 2.
- the model generation device 1 is a computer configured to generate a trained estimation model 3 that can be used to estimate the line-of-sight direction of the target person.
- the model generation device 1 acquires the learning calibration information 50 including the learning feature information and the learning true value information.
- the learning feature information relates to the line of sight of the eyes of the subject looking in a predetermined direction.
- the learning true value information indicates the true value in a predetermined direction seen by the eyes of the subject.
- the predetermined direction is a line-of-sight direction known by the true value.
- the specific value in the predetermined direction may not be particularly limited and may be appropriately selected depending on the embodiment. As an example of a predetermined direction, it is preferable to select a direction that is likely to appear in the scene of estimating the line-of-sight direction of the subject.
- the model generation device 1 acquires a plurality of learning data sets 51 each composed of a combination of a learning target image 53 in which the subject's eyes are captured and correct answer information 55.
- the correct answer information 55 indicates the true value in the line-of-sight direction of the subject reflected in the learning target image 53.
- the plurality of learning data sets 51 may include learning data sets obtained from a subject looking in a predetermined direction, as in the case of the training calibration information 50.
- “for learning” means that it is used for machine learning. This "for learning” description may be omitted.
- the model generation device 1 performs machine learning of the estimation model 3 using the acquired plurality of learning data sets 51.
- machine learning is an estimation model 3 for each learning data set 51 so as to output an output value matching the corresponding correct answer information 55 with respect to the input of the learning target image 53 and the learning calibration information 50.
- Consists of training This makes it possible to generate a trained estimation model 3 that has acquired the ability to estimate the line-of-sight direction of the subject reflected in the target image from the calibration information and the target image.
- "learned” may be read as "trained”.
- the line-of-sight estimation device 2 is a computer configured to estimate the line-of-sight direction of the subject R by using the generated learned estimation model 3. Specifically, the line-of-sight estimation device 2 according to the present embodiment acquires calibration information 60 including feature information and true value information for the subject R.
- the calibration information 60 is the same kind of data as the learning calibration information 50 obtained from the subject.
- the subject R may or may not be the same person as the subject.
- the feature information relates to the line of sight of the subject R looking in a predetermined direction.
- the form of the feature information is not particularly limited as long as it contains components related to eye features that form a line of sight in a predetermined direction, and may be appropriately determined according to an embodiment.
- the feature information may be composed of a reference image in which the eyes of a subject looking in a predetermined direction are captured.
- the feature information may be composed of the feature amount of the line of sight extracted from the reference image.
- the feature information is the same kind of data as the above-mentioned learning feature information.
- the true value information indicates the true value in a predetermined direction as seen by the subject R's eyes.
- the true value data format that is, the expression format of the line-of-sight direction is not particularly limited as long as it indicates information regarding the line-of-sight direction, and may be appropriately selected depending on the embodiment.
- the line-of-sight direction may be expressed by an angle such as an elevation / depression angle or an azimuth angle.
- the line-of-sight direction may be expressed by a position (hereinafter, also referred to as a “gaze position”) that is gazed within the visual field range.
- the angle or gaze position may be expressed directly numerically, or may be expressed by degree or probability using a heat map.
- the true value information is the same kind of data as the true value information for learning.
- the inclusion of the feature information and the true value information in the calibration information 60 may be configured by including the feature information and the true value information as they are as separate data (for example, in a separable format), or the feature information. And the information derived by combining the true value information (for example, the calibration feature amount described later) may be included. An example of the configuration of the calibration information 60 will be described later.
- the line-of-sight estimation device 2 acquires the target image 63 in which the eyes of the target person R are captured.
- the line-of-sight estimation device 2 is connected to the camera S, and the target image 63 can be acquired from the camera S.
- the target image 63 may include an image of the eyes of the subject R.
- the target image 63 may be an image as it is obtained by the camera S, or may be a partial image extracted from the obtained image.
- the partial image may be obtained, for example, by extracting a range in which at least one eye is captured from the image obtained by the camera S.
- a known image process may be used to extract the partial image.
- the line-of-sight estimation device 2 estimates the line-of-sight direction of the eyes of the target person R reflected in the target image 63 by using the learned estimation model 3 generated by the machine learning.
- the acquired target image 63 and the calibration information 60 are input to the trained estimation model 3, and the arithmetic processing of the trained estimation model 3 is executed, so that the target person reflected in the target image 63 appears. It is configured by acquiring an output value corresponding to the result of estimating the line-of-sight direction of the R eye from the trained estimation model 3.
- the line-of-sight estimation device 2 outputs information regarding the result of estimating the line-of-sight direction of the subject R.
- the calibration information 60 including the feature information and the true value information is used to estimate the line-of-sight direction of the target person R.
- the feature information and the true value information it is possible to grasp the characteristics of the eyes that form the line of sight in a known direction from the true value (that is, the individuality of the line of sight of the subject R). Therefore, according to the present embodiment, it is possible to calibrate the difference in the line-of-sight direction due to the individual difference between the subject and the subject R by further using the calibration information 60 for estimating the line-of-sight direction.
- the line-of-sight direction of the subject R can be estimated in consideration of individual differences. Therefore, according to the present embodiment, it is possible to improve the accuracy of estimating the line-of-sight direction of the subject R in the line-of-sight estimation device 2. Further, according to the model generation device 1 according to the present embodiment, it is possible to generate a learned estimation model 3 capable of estimating the line-of-sight direction of the subject R with such high accuracy.
- This embodiment may be applied to any situation in which the line-of-sight direction of the subject R is estimated.
- the scene of estimating the line-of-sight direction for example, the scene of estimating the line-of-sight direction of the driver who drives the vehicle, the scene of estimating the line-of-sight direction of the user communicating with the robot device, and the scene of estimating the line-of-sight direction of the user in the user interface. Then, a scene where the obtained estimation result is used for input can be mentioned.
- the driver and the user are examples of the subject R.
- the estimation result of the line-of-sight direction may be appropriately used according to each scene.
- the model generation device 1 and the line-of-sight estimation device 2 are connected to each other via a network.
- the type of network may be appropriately selected from, for example, the Internet, a wireless communication network, a mobile communication network, a telephone network, a dedicated network, and the like.
- the method of exchanging data between the model generation device 1 and the line-of-sight estimation device 2 does not have to be limited to such an example, and may be appropriately selected depending on the embodiment.
- data may be exchanged between the model generation device 1 and the line-of-sight estimation device 2 using a storage medium.
- the model generation device 1 and the line-of-sight estimation device 2 are each configured by a separate computer.
- the configuration of the line-of-sight estimation system 100 according to the present embodiment does not have to be limited to such an example, and may be appropriately determined according to the embodiment.
- the model generation device 1 and the line-of-sight estimation device 2 may be an integrated computer.
- at least one of the model generation device 1 and the line-of-sight estimation device 2 may be configured by a plurality of computers.
- FIG. 2 schematically illustrates an example of the hardware configuration of the model generation device 1 according to the present embodiment.
- the control unit 11, the storage unit 12, the communication interface 13, the external interface 14, the input device 15, the output device 16, and the drive 17 are electrically connected. It is a computer that has been used.
- the communication interface and the external interface are described as "communication I / F" and "external I / F".
- the control unit 11 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like, which are hardware processors, and is configured to execute information processing based on a program and various data.
- the storage unit 12 is an example of a memory, and is composed of, for example, a hard disk drive, a solid state drive, or the like. In the present embodiment, the storage unit 12 stores various information such as the model generation program 81, the plurality of data sets 120, and the learning result data 125.
- the model generation program 81 is a program for causing the model generation device 1 to execute information processing (FIG. 6) described later, which generates a trained estimation model 3 by performing machine learning.
- the model generation program 81 includes a series of instructions for the information processing.
- Each data set 120 is composed of a combination of the learning image 121 and the correct answer information 123.
- the training result data 125 shows information about the trained estimation model 3 generated by machine learning.
- the learning result data 125 is generated as a result of executing the model generation program 81. Details will be described later.
- the communication interface 13 is, for example, a wired LAN (Local Area Network) module, a wireless LAN module, or the like, and is an interface for performing wired or wireless communication via a network.
- the model generation device 1 may execute data communication via a network with another information processing device by using the communication interface 13.
- the external interface 14 is, for example, a USB (Universal Serial Bus) port, a dedicated port, or the like, and is an interface for connecting to an external device.
- the type and number of external interfaces 14 may be arbitrarily selected.
- the model generator 1 may be connected to a camera for obtaining the learning image 121 via at least one of the communication interface 13 and the external interface 14.
- the input device 15 is, for example, a device for inputting a mouse, a keyboard, or the like.
- the output device 16 is, for example, a device for outputting a display, a speaker, or the like. An operator such as a user can operate the model generation device 1 by using the input device 15 and the output device 16.
- the drive 17 is, for example, a CD drive, a DVD drive, or the like, and is a drive device for reading various information such as a program stored in the storage medium 91.
- the storage medium 91 electrically, magnetically, optically, mechanically or chemically acts on the information of the program or the like so that the computer or other device, the machine or the like can read various information of the stored program or the like. It is a medium that accumulates by. At least one of the model generation program 81 and the plurality of data sets 120 may be stored in the storage medium 91.
- the model generation device 1 may acquire at least one of the model generation program 81 and the plurality of data sets 120 from the storage medium 91. Note that FIG.
- the type of the storage medium 91 is not limited to the disc type, and may be other than the disc type. Examples of storage media other than the disk type include semiconductor memories such as flash memories.
- the type of the drive 17 may be arbitrarily selected according to the type of the storage medium 91.
- the control unit 11 may include a plurality of hardware processors.
- the hardware processor may be composed of a microprocessor, an FPGA (field-programmable gate array), a DSP (digital signal processor), or the like.
- the storage unit 12 may be composed of a RAM and a ROM included in the control unit 11. At least one of the communication interface 13, the external interface 14, the input device 15, the output device 16, and the drive 17 may be omitted.
- the model generator 1 may be composed of a plurality of computers. In this case, the hardware configurations of the computers may or may not match. Further, the model generation device 1 may be a general-purpose server device, a PC (Personal Computer), or the like, in addition to an information processing device designed exclusively for the provided service.
- FIG. 3 schematically illustrates an example of the hardware configuration of the line-of-sight estimation device 2 according to the present embodiment.
- the control unit 21, the storage unit 22, the communication interface 23, the external interface 24, the input device 25, the output device 26, and the drive 27 are electrically connected. It is a computer that has been used.
- the control units 21 to drive 27 and the storage medium 92 of the line-of-sight estimation device 2 may be configured in the same manner as the control units 11 to drive 17 and the storage medium 91 of the model generation device 1, respectively.
- the control unit 21 includes a CPU, RAM, ROM, etc., which are hardware processors, and is configured to execute various information processing based on programs and data.
- the storage unit 22 is composed of, for example, a hard disk drive, a solid state drive, or the like. In the present embodiment, the storage unit 22 stores various information such as the line-of-sight estimation program 82, the calibration information 60, and the learning result data 125.
- the line-of-sight estimation program 82 is a program for causing the line-of-sight estimation device 2 to perform information processing (FIG. 7) described later for estimating the line-of-sight direction of the target person R reflected in the target image 63 by using the trained estimation model 3. is there.
- the line-of-sight estimation program 82 includes a series of instructions for the information processing. At least one of the line-of-sight estimation program 82, the calibration information 60, and the learning result data 125 may be stored in the storage medium 92. Further, the line-of-sight estimation device 2 may acquire at least one of the line-of-sight estimation program 82, the calibration information 60, and the learning result data 125 from the storage medium 92.
- the line-of-sight estimation device 2 is connected to the camera S (imaging device) via the external interface 24.
- the line-of-sight estimation device 2 can acquire the target image 63 from the camera S.
- the connection method with the camera S does not have to be limited to such an example, and may be appropriately selected depending on the embodiment.
- the camera S includes a communication interface
- the line-of-sight estimation device 2 may be connected to the camera S via the communication interface 23.
- the type of camera S may be appropriately selected according to the embodiment.
- the camera S may be, for example, a general RGB camera, a depth camera, an infrared camera, or the like.
- the camera S may be appropriately arranged so as to photograph the eyes of the subject R.
- the components can be omitted, replaced, or added as appropriate according to the embodiment.
- the control unit 21 may include a plurality of hardware processors.
- the hardware processor may be composed of a microprocessor, FPGA, DSP and the like.
- the storage unit 22 may be composed of a RAM and a ROM included in the control unit 21. At least one of the communication interface 23, the external interface 24, the input device 25, the output device 26, and the drive 27 may be omitted.
- the line-of-sight estimation device 2 may be composed of a plurality of computers. In this case, the hardware configurations of the computers may or may not match. Further, the line-of-sight estimation device 2 may be a general-purpose server device, a general-purpose PC, a PLC (programmable logic controller), or the like, in addition to an information processing device designed exclusively for the provided service.
- Model generator> 4A and 4B schematically illustrate an example of the software configuration of the model generator 1 according to the present embodiment.
- the control unit 11 of the model generation device 1 expands the model generation program 81 stored in the storage unit 12 into the RAM. Then, the control unit 11 controls each component by interpreting and executing the instruction included in the model generation program 81 expanded in the RAM by the CPU.
- the model generation device 1 according to the present embodiment includes the collection unit 111, the first acquisition unit 112, the second acquisition unit 113, the machine learning unit 114, and the storage processing unit 115. Operates as a computer equipped as a software module. That is, in the present embodiment, each software module of the model generation device 1 is realized by the control unit 11 (CPU).
- the collection unit 111 acquires a plurality of data sets 120.
- Each data set 120 is composed of a combination of a learning image 121 in which the subject's eyes are captured and correct answer information 123.
- the correct answer information 123 indicates the true value in the line-of-sight direction of the subject reflected in the corresponding learning image 121.
- the first acquisition unit 112 acquires the learning calibration information 50 including the learning feature information 502 and the learning true value information 503.
- the learning feature information 502 relates to the line of sight of the eyes of the subject looking in a predetermined direction.
- the learning true value information 503 indicates the true value of the corresponding learning feature information 502 in a predetermined direction seen by the eyes of the subject.
- the data set 120 obtained for the subject looking in a predetermined direction can be used to acquire the learning calibration information 50.
- the learning calibration information 50 may include learning feature information 502 and learning true value information 503 corresponding to each of a plurality of different predetermined directions. That is, a plurality of predetermined directions for grasping the individuality of the line of sight of a person may be set, and the calibration information may include feature information and true value information for each set predetermined direction.
- the second acquisition unit 113 is a plurality of learning data composed of a combination of a learning target image 53 in which the subject's eyes are captured and correct answer information 55 indicating the true value of the subject's line-of-sight direction captured in the learning target image 53.
- Acquire set 51 In the present embodiment, each of the above data sets 120 can be used as the learning data set 51. That is, the learning image 121 can be used as the learning target image 53, and the correct answer information 123 can be used as the correct answer information 55.
- the machine learning unit 114 uses the acquired plurality of learning data sets 51 to perform machine learning of the estimation model 3. Performing machine learning is an estimation model 3 for each learning data set 51 so as to output an output value matching the corresponding correct answer information 55 with respect to the input of the learning target image 53 and the learning calibration information 50. Consists of training.
- the configuration of the estimation model 3 is not particularly limited as long as the calculation for estimating the line-of-sight direction of the person can be executed from the calibration information and the target image, and may be appropriately determined according to the embodiment. Further, as long as the component related to the feature information and the true value information (that is, the component related to the feature of the eye forming the line of sight in a known direction) is included, the data format of the calibration information does not have to be particularly limited, and the data format of the calibration information may be limited. It may be appropriately determined according to the form. The machine learning procedure may be appropriately determined according to the configuration of the estimation model 3 and the calibration information.
- the estimation model 3 includes an extractor 31 and an estimator 32.
- the extractor 31 is an example of the first extractor.
- the inclusion of the feature information and the true value information is configured by including the calibration feature amount related to the calibration derived by combining the feature information and the true value information. That is, the calibration information is composed of calibration features. Combining may simply unify the information, or may include unifying the information and compressing the information.
- the extractor 35 and the coupler 36 are used to obtain the calibration features.
- the extractor 35 is an example of the second extractor.
- the extractor 31 is configured to receive an input of an image (target image) in which a person's eyes appear and output an output value corresponding to a feature amount related to the input image.
- the extractor 31 is configured to extract a feature amount from an image in which a person's eyes are captured.
- the estimator 32 receives the input of the feature amount and the calibration feature amount calculated by the extractor 31, and the person appears in the corresponding image (that is, the image input to the extractor 31 to obtain the input feature amount). It is configured to output an output value corresponding to the result of estimating the line-of-sight direction of.
- the estimator 32 is configured to estimate the line-of-sight direction of a person from the feature amount and the calibration feature amount of the image.
- the output of the extractor 31 is connected to the input of the estimator 32.
- the extractor 35 is configured to receive an input of an image in which a person's eyes appear and output an output value corresponding to a feature amount related to the input image.
- the extractor 35 may use the same extractor as the extractor 31 (that is, the extractor 35 is the same as the extractor 31), or may use an extractor different from the extractor 31 (that is, the extractor 31). , The extractor 35 is the same as the extractor 31).
- the coupler 36 is configured to accept input of feature information and true value information and output an output value corresponding to a calibration feature amount related to calibration derived by combining the input feature information and true value information.
- the feature information is composed of a feature amount related to a reference image in which the eyes of a person (target person) looking in a predetermined direction are captured.
- the machine learning unit 114 first includes a learning model 4 including an extractor 41 and an estimator 43 to generate a trained extractor that can be used as each extractor (31, 35).
- a learning model 4 including an extractor 41 and an estimator 43 to generate a trained extractor that can be used as each extractor (31, 35).
- the extractor 41 corresponds to each extractor (31, 35).
- the output of the extractor 41 is connected to the input of the estimator 43.
- the estimator 43 receives the input of the feature amount calculated by the extractor 41, and determines the line-of-sight direction of the person appearing in the corresponding image (that is, the image input to the extractor 41 in order to obtain the input feature amount). It is configured to output the output value corresponding to the estimated result.
- the machine learning unit 114 uses the acquired plurality of data sets 120 to perform machine learning of the learning model 4.
- the machine learning unit 114 inputs the learning image 121 included in each data set 120 to the extractor 41, and executes arithmetic processing of the extractor 41 and the estimator 43. By this arithmetic processing, the machine learning unit 114 acquires an output value corresponding to the result of estimating the line-of-sight direction of the subject reflected in the learning image 121 from the estimator 43.
- the machine learning unit 114 trains the learning model 4 for each data set 120 so that the output value obtained from the estimator 43 by the arithmetic processing matches the correct answer information 123.
- the output (that is, the feature amount) of the trained extractor 41 contains a component related to the subject's eyes included in the training image 121 so that the gaze direction of the subject can be estimated by the estimator 43. Will be included.
- the trained extractor 41 generated by machine learning may be commonly used as each extractor (31, 35). In this case, the amount of information of each extractor (31, 35) can be reduced, and the cost of machine learning can be suppressed.
- the machine learning unit 114 may prepare a separate learning model 4 for at least the part of the extractor 41 and perform each machine learning. Good. Then, the trained extractor 41 generated by each machine learning may be used as each extractor (31, 35). For each extractor (31, 35), the trained extractor 41 may be used as it is, or a replica of the trained extractor 41 may be used.
- the extractor 35 may be prepared separately for each set different direction, or may be prepared in common for a plurality of set different directions. Good. When the extractor 35 is prepared in common for a plurality of different popes, the amount of information of the extractor 35 can be reduced and the cost of machine learning can be suppressed.
- the machine learning unit 114 prepares a learning model 30 including an extractor 35, a coupler 36, and an estimation model 3.
- the machine learning unit 114 performs machine learning of the learning model 30 so that the estimator 32 of the estimation model 3 finally acquires the ability to estimate the line-of-sight direction of the person.
- the output of the coupler 36 is connected to the input of the estimator 32.
- both the estimator 32 and the coupler 36 are trained in the machine learning of the learning model 30.
- the first acquisition unit 112 acquires the learning calibration information 50 by using the extractor 35 and the coupler 36. Specifically, the first acquisition unit 112 acquires the learning reference image 501 in which the eyes of the subject looking in a predetermined direction are captured and the learning true value information 503 indicating the true value in the predetermined direction. The first acquisition unit 112 acquires the learning image 121 included in the data set 120 obtained for the subject looking in a predetermined direction as the learning reference image 501, and acquires the correct answer information 123 as the learning true value information 503. You may.
- the first acquisition unit 112 inputs the acquired learning reference image 501 to the extractor 35, and executes the arithmetic processing of the extractor 35. As a result, the first acquisition unit 112 acquires the output value corresponding to the feature amount 5021 regarding the learning reference image 501 from the extractor 35.
- the learning feature information 502 is composed of the feature amount 5021.
- the first acquisition unit 112 inputs the acquired feature amount 5021 and the learning true value information 503 into the coupler 36, and executes the arithmetic processing of the coupler 36.
- the first acquisition unit 112 acquires an output value corresponding to the calibration feature amount 504 derived by combining the learning feature information 502 and the learning true value information 503 from the coupler 36.
- the feature amount 504 is an example of the calibration feature amount for learning.
- the learning calibration information 50 is composed of the feature amount 504.
- the first acquisition unit 112 can acquire the learning calibration information 50 by using the extractor 35 and the coupler 36 by these arithmetic processes.
- the first acquisition unit 112 may acquire the learning reference image 501 and the learning true value information 503 for each of the plurality of different predetermined directions.
- the first acquisition unit 112 may input each learning reference image 501 to the extractor 35 and execute the arithmetic processing of the extractor 35.
- the first acquisition unit 112 may acquire each feature amount 5021 from the extractor 35.
- the first acquisition unit 112 may input the acquired feature amount 5021 and the learning true value information 503 for each predetermined direction into the coupler 36, and execute the arithmetic processing of the coupler 36. ..
- the first acquisition unit 112 may acquire the feature amount 504 derived by combining the learning feature information 502 and the learning true value information 503 in each of a plurality of different predetermined directions. ..
- the feature amount 504 may include information that aggregates the learning feature information 502 and the learning true value information 503 in each of a plurality of different predetermined directions.
- the method of acquiring the feature amount 504 does not have to be limited to such an example.
- the feature amount 504 may be calculated for each different predetermined direction.
- a common coupler 36 may be used for calculating the feature amount 504, or different couplers 36 may be used for different predetermined directions.
- the second acquisition unit 113 acquires a plurality of learning data sets 51 each composed of a combination of the learning target image 53 and the correct answer information 55.
- the second acquisition unit 113 may use at least one of the plurality of collected data sets 120 as the learning data set 51. That is, the second acquisition unit 113 acquires the learning image 121 of the data set 120 as the learning target image 53 of the learning data set 51, and the correct answer information 123 of the data set 120 as the correct answer information 55 of the learning data set 51. You may.
- the machine learning unit 114 inputs the learning target image 53 included in each acquired learning data set 51 into the extractor 31, and executes the arithmetic processing of the extractor 31.
- the machine learning unit 114 acquires the feature amount 54 related to the learning target image 53 from the extractor 31.
- the machine learning unit 114 inputs the feature amount 504 (learning calibration information 50) acquired from the coupler 36 and the acquired feature amount 54 into the estimator 32, and executes the arithmetic processing of the estimator 32. ..
- the machine learning unit 114 acquires an output value corresponding to the result of estimating the line-of-sight direction of the subject reflected in the learning target image 53 from the estimator 32.
- the machine learning unit 114 corresponds to the output value obtained from the estimator 32 for each learning data set 51 while accompanied by the calculation of the feature amount 504 and the arithmetic processing of the estimation model 3.
- the learning model 30 is trained so as to conform to the correct answer information 55.
- the training of the learning model 30 may include training of each extractor (31, 35).
- each extractor (31, 35) is trained to acquire the ability to extract a feature amount including a component capable of estimating the line-of-sight direction of a person from an image. Therefore, in the training of the learning model 30, the training of each extractor (31, 359 may be omitted.
- the coupler 36 combines the feature information and the true value information to form a person. It is possible to acquire the ability to derive a calibration feature amount useful for estimating the line-of-sight direction of the estimator 32. Further, the estimator 32 uses the image feature amount obtained by the extractor 31 and the calibration feature amount obtained by the coupler 36. It is possible to acquire the ability to appropriately estimate the line-of-sight direction of a person appearing in the corresponding image.
- the learning reference image 501 and the learning true value information 503 used for calculating the feature amount 504 are derived from the same subject as the learning data set 51 used in the training. Is preferable. That is, it is assumed that the learning reference image 501, the learning true value information 503, and the plurality of learning data sets 51 are acquired from each of the plurality of different subjects. In this case, the origins of the learning reference image 501, the learning true value information 503, and the plurality of learning data sets 51 obtained from the same subject are identified so as to be used for machine learning of the learning model 30. Is preferable. Each origin (ie, subject) may be identified by additional information such as an identifier.
- each data set 120 may further include this additional information.
- the subject from which each is derived can be identified based on the additional information, whereby the learning reference image 501, the learning true value information 503, and a plurality of learning data sets obtained from the same subject can be identified. 51 can be used for machine learning of the learning model 30.
- the storage processing unit 115 generates information about the trained learning model 30 (that is, the trained extractor 31, the trained coupler 36, and the trained estimation model 3) as the training result data 125. Then, the storage processing unit 115 stores the generated learning result data 125 in a predetermined storage area.
- Each extractor (31, 35, 41), each estimator (32, 43), and coupler 36 is composed of a machine-learnable model with arithmetic parameters.
- the type of the machine learning model used for each is not particularly limited as long as each arithmetic processing can be executed, and may be appropriately selected according to the embodiment.
- a convolutional neural network is used for each extractor (31, 35, 41).
- a fully connected neural network is used for each estimator (32, 43) and the coupler 36.
- each extractor (31, 35, 41) includes a convolution layer (311, 351 and 411) and a pooling layer (312, 352, 412).
- the convolution layers (311 and 351 and 411) are configured to perform convolution operations on given data.
- the convolution operation corresponds to a process of calculating the correlation between a given data and a predetermined filter. For example, by convolving an image, a shading pattern similar to the shading pattern of the filter can be detected from the input image.
- the convolution layer (311, 351 and 411) is a neuron corresponding to this convolution operation, and is a neuron that connects to a part of the output of the input or the layer placed before (input side) the own layer.
- the pooling layer (312, 352, 412) is configured to perform a pooling process.
- the pooling process discards a part of the information on the position where the response of the given data to the filter is strong, and realizes the invariance of the response to the minute position change of the feature appearing in the data. For example, in the pooling process, the largest value in the filter may be extracted and the other values may be deleted.
- each extractor (31, 35, 41) The number of convolutional layers (311 and 351 and 411) and pooling layers (312, 352, 412) contained in each extractor (31, 35, 41) is not particularly limited and may be appropriately used according to the embodiment. May be decided.
- a convolution layer (311 and 351 and 411) is arranged on the most input side (left side of the figure), and this convolution layer (311 and 351 and 411) constitutes an input layer.
- a pooling layer (312, 352, 412) is arranged on the most output side (right side in the figure), and this pooling layer (312, 352, 412) constitutes an output layer.
- the structure of each extractor (31, 35, 41) does not have to be limited to such an example.
- the arrangement of the convolution layers (311 and 351 and 411) and the pooling layers (312, 352, 412) may be appropriately determined according to the embodiment.
- the convolutional layers (311 and 351 and 411) and the pooling layers (312, 352, 412) may be arranged alternately.
- one or more pooling layers (312, 352, 412) may be arranged after the plurality of convolution layers (311 and 351 and 411) are arranged consecutively.
- the types of layers included in each extractor (31, 35, 41) are not limited to the convolution layer and the pooling layer.
- Each extractor (31, 35, 41) may include other types of layers such as, for example, a normalized layer, a dropout layer, a fully connected layer, and the like.
- each extractor (31, 35) is derived from the structure of the extractor 41 used for each.
- the structures may or may not match between the extractor 31 and the extractor 35.
- the structures of the extractors 35 prepared in each predetermined direction may be the same.
- the structure of at least some of the extractors 35 may be different from that of the other extractors 35.
- each estimator (32, 43) and coupler 36 includes one or more fully coupled layers (321, 431, 361).
- the number of fully connected layers (321, 431, 361) included in each estimator (32, 43) and the coupler 36 is not particularly limited and may be appropriately determined according to the embodiment.
- the fully connected layer arranged on the input side constitutes the input layer
- the fully connected layer arranged on the output side constitutes the output layer.
- the fully connected layer arranged between the input layer and the output layer constitutes an intermediate (hidden) layer.
- the one fully connected layer operates as an input layer and an output layer.
- Each fully connected layer (321, 431, 361) comprises one or more neurons (nodes).
- the number of neurons (nodes) contained in each fully connected layer (321, 431, 361) is not particularly limited and may be appropriately selected depending on the embodiment.
- the number of neurons included in the input layer may be determined according to, for example, the input data such as the feature amount and the true value information and the format thereof. Further, the number of neurons included in the output layer may be determined according to, for example, the output data such as the feature amount and the estimation result and the format thereof.
- Each neuron contained in each fully connected layer (321, 431, 361) is connected to all neurons in the adjacent layer. However, the connection relationship of each neuron is not limited to such an example, and may be appropriately determined according to the embodiment.
- Weights are set for each bond of the convolution layer (311 and 351 and 411) and the fully bonded layer (321, 431 and 361).
- a threshold is set for each neuron, and basically, the output of each neuron is determined by whether or not the sum of the products of each input and each weight exceeds the threshold.
- the threshold value may be expressed by an activation function. In this case, the output of each neuron is determined by inputting the sum of the products of each input and each weight into the activation function and executing the operation of the activation function.
- the type of activation function may be arbitrarily selected.
- the weight of the connection between each neuron and the threshold value of each neuron contained in the convolution layer (311, 351 and 411) and the fully connected layer (321, 431, 361) are determined by each extractor (31, 35, 41) and each estimation. It is an example of the calculation parameter used for the calculation processing of the device (32, 43) and the coupling device 36.
- the input and output data formats of the extractors (31, 35, 41), the estimators (32, 43), and the coupler 36 may not be particularly limited, and may be appropriately used according to the embodiment. May be decided.
- the output layer of each estimator (32, 43) may be configured to directly output (eg, regress) the estimation result.
- the output layer of each estimator (32, 43) is provided with one or more neurons for each class to be identified, and each neuron outputs the probability corresponding to the corresponding class. It may be configured to output the result indirectly.
- the input layers of the extractors (31, 35, 41), the estimators (32, 43), and the coupler 36 are other than the input data such as the reference image, the target image, the feature amount, and the true value information. It may be configured to further accept input of other data. Any preprocessing may be applied to the input data before it is input to the input layer.
- the machine learning unit 114 extracts the data set 120 so that the error between the output value obtained from the estimator 43 by the arithmetic processing and the correct answer information 123 becomes small. The adjustment of the value of each calculation parameter of 41 and the estimator 43 is repeated. This makes it possible to generate a trained extractor 41. Further, in the machine learning of the learning model 30, the machine learning unit 114 outputs the output values obtained from the estimator 32 by the above arithmetic processing for the learning reference image 501, the learning true value information 503, and each learning data set 51.
- the storage processing unit 115 generates learning result data 125 for reproducing the trained estimation model 3 (extractor 31 and estimator 32), the trained extractor 35, and the trained coupler 36 generated by machine learning. To do.
- the configuration of the learning result data 125 may be arbitrary as long as each can be reproduced.
- the storage processing unit 115 generates information indicating the values of the calculated calculation parameters of the generated trained estimation model 3, the trained extractor 35, and the trained coupler 36 as the training result data 125.
- the learning result data 125 may further include information indicating each structure.
- the structure may be specified, for example, by the number of layers from the input layer to the output layer in the neural network, the type of each layer, the number of neurons included in each layer, the connection relationship between neurons in adjacent layers, and the like.
- the storage processing unit 115 stores the generated learning result data 125 in a predetermined storage area.
- the machine learning results of each extractor (31, 35), estimator 32, and coupler 36 are stored as one learning result data 125 .
- the storage format of the learning result data 125 does not have to be limited to such an example.
- the machine learning results of each extractor (31, 35), estimator 32 and coupler 36 may be stored as separate data.
- ⁇ Gaze estimation device> 5A and 5B schematically illustrate an example of the software configuration of the line-of-sight estimation device 2 according to the present embodiment.
- the control unit 21 of the line-of-sight estimation device 2 expands the line-of-sight estimation program 82 stored in the storage unit 22 into the RAM. Then, the control unit 21 controls each component by interpreting and executing the instruction included in the line-of-sight estimation program 82 expanded in the RAM by the CPU.
- the line-of-sight estimation device 2 according to the present embodiment operates as a computer including an information acquisition unit 211, an image acquisition unit 212, an estimation unit 213, and an output unit 214 as software modules. To do. That is, in the present embodiment, each software module of the line-of-sight estimation device 2 is realized by the control unit 21 (CPU) in the same manner as the model generation device 1.
- the information acquisition unit 211 provides calibration information 60 including feature information 602 regarding the line of sight of the eyes of the subject R looking in a predetermined direction and true value information 603 indicating the true value of the eyes of the subject R in the predetermined direction. get.
- the information acquisition unit 211 has the learned extractor 35 and the coupler 36 by holding the learning result data 125.
- the information acquisition unit 211 acquires the reference image 601 in which the eyes of the subject R looking in a predetermined direction are captured.
- the information acquisition unit 211 inputs the acquired reference image to the learned extractor 35, and executes arithmetic processing of the extractor 35.
- the information acquisition unit 211 acquires the output value corresponding to the feature amount 6021 related to the reference image 601 from the extractor 35.
- the feature amount 6021 is an example of the second feature amount.
- the feature information 602 is composed of the feature amount 6021.
- the information acquisition unit 211 acquires the true value information 603. Then, the information acquisition unit 211 inputs the acquired feature amount 6021 and the true value information 603 into the learned coupler 36, and executes the arithmetic processing of the coupler 36. As a result, the information acquisition unit 211 acquires the output value corresponding to the feature amount 604 related to the calibration derived by combining the feature information 602 and the true value information 603 from the coupler 36.
- the feature amount 604 is an example of the calibration feature amount.
- the calibration information 60 is composed of the feature amount 604.
- the information acquisition unit 211 can acquire the calibration information 60 (feature amount 604) by using the learned extractor 35 and the coupler 36 by these arithmetic processes.
- the calibration information 60 may include feature information 602 and true value information 603 corresponding to each of a plurality of different predetermined directions in response to the generation process of the trained estimation model 3.
- the information acquisition unit 211 may acquire the reference image 601 and the true value information 603 for each of the plurality of different predetermined directions.
- the information acquisition unit 211 may acquire each feature amount 6021 from the extractor 35 by inputting each reference image 601 into the learned extractor 35 and executing the arithmetic processing of the extractor 35. Subsequently, the information acquisition unit 211 may input the acquired feature amount 6021 and the true value information 603 for each predetermined direction into the learned coupler 36, and execute the arithmetic processing of the coupler 36. ..
- the information acquisition unit 211 may acquire the feature amount 604 related to the calibration from the coupler 36.
- the feature amount 604 may include information to which the feature information 602 and the true value information 603 are attached to each of a plurality of different predetermined directions.
- the method of acquiring the feature amount 604 does not have to be limited to such an example.
- the feature amount 604 may be calculated for each different predetermined direction corresponding to the generation process. In this case, a common coupler 36 may be used for the calculation of the feature amount 604, or different couplers 36 may be used for different predetermined directions.
- the image acquisition unit 212 acquires the target image 63 in which the eyes of the target person R are captured.
- the estimation unit 213 has the trained estimation model 3 generated by machine learning by holding the learning result data 125.
- the estimation unit 213 uses the trained estimation model 3 to estimate the line-of-sight direction of the eyes of the target person R reflected in the target image 63.
- the estimation unit 213 inputs the acquired target image 63 and the calibration information 60 into the trained estimation model 3, and executes the arithmetic processing of the trained estimation model 3.
- the estimation unit 213 acquires an output value corresponding to the result of estimating the eye-gaze direction of the target person R reflected in the target image 63 from the trained estimation model 3.
- the arithmetic processing of the trained estimation model 3 may be appropriately determined according to the configuration of the trained estimation model 3.
- the trained estimation model 3 includes a trained extractor 31 and an estimator 32.
- the estimation unit 213 inputs the acquired target image 63 into the learned extractor 31, and executes the arithmetic processing of the extractor 31.
- the estimation unit 213 acquires an output value corresponding to the feature amount 64 related to the target image 63 from the extractor 31.
- the feature amount 64 is an example of the first feature amount.
- the feature amount 6021 and the feature amount 64 may be read as image feature amounts, respectively.
- the estimation unit 213 inputs the feature amount 604 acquired by the information acquisition unit 211 and the feature amount 64 acquired from the extractor 31 into the estimator 32, and executes the arithmetic processing of the estimator 32.
- executing the arithmetic processing of the trained estimation model 3 is configured by executing the arithmetic processing of these extractors 31 and the estimator 32.
- the estimation unit 213 can obtain an output value corresponding to the result of estimating the eye-gaze direction of the target person R reflected in the target image 63 from the estimator 32.
- the output unit 214 outputs information regarding the result of estimating the line-of-sight direction of the subject R.
- each software module of the model generation device 1 and the line-of-sight estimation device 2 will be described in detail in an operation example described later.
- an example in which each software module of the model generation device 1 and the line-of-sight estimation device 2 is realized by a general-purpose CPU is described.
- some or all of the above software modules may be implemented by one or more dedicated processors. That is, each of the above modules may be realized as a hardware module.
- software modules may be omitted, replaced, or added as appropriate according to the embodiment.
- FIG. 6 is a flowchart showing an example of the processing procedure of the model generation device 1 according to the present embodiment.
- the processing procedure described below is an example of a model generation method. However, the processing procedure described below is only an example, and each step may be changed as much as possible. Further, with respect to the processing procedure described below, steps can be omitted, replaced, and added as appropriate according to the embodiment.
- Step S101 the control unit 11 operates as the collection unit 111 to collect a plurality of learning data sets 120 from the subject.
- Each data set 120 is composed of a combination of a learning image 121 in which the subject's eyes are captured and correct answer information 123 indicating the true value of the subject's line-of-sight direction captured in the learning image 121.
- Each data set 120 may be generated as appropriate.
- a camera S or a camera of the same type and a subject are prepared.
- the number of subjects may be determined as appropriate.
- the subject is instructed to look in various directions, and the face of the subject looking in the instructed direction is photographed by the camera.
- the learning image 121 can be acquired.
- the learning image 121 may be an image as it is obtained by the camera.
- the learning image 121 may be generated by applying some kind of image processing to the image obtained by the camera.
- Information indicating the true value in the line-of-sight direction instructed by the subject is associated with the acquired learning image 121 as correct answer information 123.
- each data set 120 can be generated.
- the method for acquiring the learning image 121 and the correct answer information 123 the same method as the method for acquiring the reference image 601 and the true value information 603 (FIG. 8), which will be described later, may be adopted.
- Each data set 120 may be automatically generated by the operation of a computer, or may be manually generated by at least partially including an operator's operation. Further, each data set 120 may be generated by the model generation device 1 or by a computer other than the model generation device 1.
- the control unit 11 automatically or manually executes the generation process by the operation of the operator via the input device 15 to generate the plurality of data sets 120. get.
- the control unit 11 acquires a plurality of data sets 120 generated by the other computer via, for example, a network, a storage medium 91, or the like.
- Some datasets 120 may be generated by the model generator 1 and other datasets 120 may be generated by one or more other computers.
- the number of data sets 120 to be acquired is not particularly limited and may be appropriately determined according to the embodiment.
- the control unit 11 proceeds to the next step S102.
- Step S102 the control unit 11 operates as the machine learning unit 114, and uses the collected data sets 120 to perform machine learning of the learning model 4.
- the output value (estimation result in the line-of-sight direction) obtained from the estimator 43 by inputting the learning image 121 to the extractor 41 is the corresponding correct answer information 123.
- the extractor 41 and the estimator 43 are trained to be compatible with. It should be noted that not all the collected data sets 120 must be used for machine learning of the learning model 4.
- the data set 120 used for machine learning of the learning model 4 may be appropriately selected.
- the control unit 11 prepares a neural network that constitutes each of the extractor 41 and the estimator 43 to be processed by machine learning.
- the structure of each neural network for example, the number of layers, the type of each layer, the number of neurons contained in each layer, the connection relationship between neurons in adjacent layers, etc.
- the initial value of the connection weight between each neuron may be given by the template or by the input of the operator.
- the control unit 11 may prepare the extractor 41 and the estimator 43 based on the learning result data obtained by the past machine learning.
- control unit 11 uses the training image 121 of each data set 120 as training data (input data), the correct answer information 123 as teacher data (teacher signal, label), and the extractor 41 and the estimator 43. Perform the training process of.
- training data input data
- teacher data teacher signal, label
- extractor 41 and the estimator 43 perform the training process of.
- a stochastic gradient descent method, a mini-batch gradient descent method, or the like may be used.
- the control unit 11 inputs the learning image 121 to the extractor 41 and executes the arithmetic processing of the extractor 41. That is, the control unit 11 inputs the learning image 121 to the input layer of the extractor 41 (in the example of FIG. 4A, the convolution layer 411 arranged on the input side most), and determines the firing of neurons, for example, in order from the input side.
- the arithmetic processing of the forward propagation of each layer (411, 412) such as is executed.
- the control unit 11 outputs the output value corresponding to the feature amount extracted from the learning image 121 from the output layer of the extractor 41 (in the example of FIG. 4A, the pooling layer 412 arranged on the output side most). get.
- the control unit 11 inputs the obtained output value (feature amount) to the input layer of the estimator 43 (the fully connected layer 431 arranged on the most input side) in the same manner as the arithmetic processing of the extractor 41. , Executes the arithmetic processing of the forward propagation of the estimator 43. By this arithmetic processing, the control unit 11 acquires an output value corresponding to the result of estimating the line-of-sight direction of the subject reflected in the learning image 121 from the output layer of the estimator 43 (the fully connected layer 431 arranged on the most output side). To do.
- the control unit 11 calculates an error between the output value obtained from the output layer of the estimator 43 and the correct answer information 123.
- a loss function may be used to calculate the error (loss).
- the loss function is a function that evaluates the difference (that is, the degree of difference) between the output of the machine learning model and the correct answer, and is calculated by the loss function as the difference value between the output value obtained from the output layer and the correct answer increases. The value of the error becomes large.
- the type of loss function used for calculating the error does not have to be particularly limited, and may be appropriately selected depending on the embodiment.
- the control unit 11 uses the error gradient of the output value calculated by the back propagation method to perform each calculation parameter of the extractor 41 and the estimator 43 (weight of connection between neurons, each neuron).
- the error of the value of (threshold value, etc.) is calculated in order from the output side.
- the control unit 11 updates the values of the calculation parameters of the extractor 41 and the estimator 43 based on the calculated errors.
- the degree to which the value of each calculation parameter is updated may be adjusted by the learning rate.
- the learning rate may be given by the operator's designation, or may be given as a set value in the program.
- the control unit 11 adjusts the values of the respective calculation parameters of the extractor 41 and the estimator 43 so that the sum of the errors of the calculated output values becomes smaller for each data set 120 by the above series of update processes. For example, until a predetermined condition such as execution is performed a specified number of times or the sum of the calculated errors is equal to or less than the threshold value is satisfied, the control unit 11 determines each calculation parameter of the extractor 41 and the estimator 43 by the series of update processes. The value may be adjusted repeatedly.
- the control unit 11 can generate a trained learning model 4 that has acquired the ability to appropriately estimate the line-of-sight direction of the subject reflected in the learning image 121 for each data set 120. Further, the output (that is, the feature amount) of the trained extractor 41 includes a component related to the subject's eyes included in the learning image 121 so that the estimator 43 can appropriately estimate the line-of-sight direction of the subject. Will be.
- the control unit 11 proceeds to the next step S103.
- step S103 the control unit 11 prepares the learning model 30 including the estimation model 3 by using the learning result of the extractor 41.
- the control unit 11 prepares each extractor (31, 35) based on the learning result of the extractor 41. That is, the control unit 11 uses the trained extractor 41 generated in step S102 or a copy thereof as each extractor (31, 35).
- the control is performed in step S102.
- Part 11 may prepare a separate learning model 4 and carry out each machine learning. Then, the control unit 11 may use the trained extractor 41 generated by each machine learning or a copy thereof for each extractor (31, 35).
- the control unit 11 prepares a neural network that constitutes each of the estimator 32 and the coupler 36. Similar to the extractor 41 and the like, the structure of the neural network constituting each of the estimator 32 and the connector 36, the initial value of the weight of the connection between each neuron, and the initial value of the threshold value of each neuron are given by the template. It may be given by the input of the operator. Further, when performing re-learning, the control unit 11 may prepare the estimator 32 and the coupler 36 based on the learning result data obtained by the past machine learning. When the learning model 30 including each extractor (31, 35), the estimator 32, and the coupler 36 is prepared, the control unit 11 proceeds to the next step S104.
- step S104 the control unit 11 operates as the first acquisition unit 112 to acquire the learning calibration information 50 including the learning feature information 502 and the learning true value information 503.
- the control unit 11 acquires the learning calibration information 50 by using the extractor 35 and the coupler 36. Specifically, the control unit 11 first determines the learning reference image 501 in which the eyes of the subject looking in a predetermined direction are captured and the true in the predetermined direction (line-of-sight direction) seen by the subject in the learning reference image 501. The learning true value information 503 indicating the value is acquired. Even if the control unit 11 acquires the learning image 121 included in the data set 120 obtained for the subject looking in a predetermined direction as the learning reference image 501 and the correct answer information 123 as the learning true value information 503. Good. Alternatively, the control unit 11 may acquire the learning reference image 501 and the learning true value information 503 separately from the data set 120. The method of acquiring the learning reference image 501 and the learning true value information 503 may be the same as the method of generating the data set 120.
- control unit 11 inputs the acquired learning reference image 501 to the input layer of the extractor 35 (in the example of FIG. 4B, the convolution layer 351 arranged on the input side most), and the order of the extractor 35. Performs propagation arithmetic processing. By this arithmetic processing, the control unit 11 arranges the output value corresponding to the feature amount 5021 (learning feature information 502) related to the learning reference image 501 on the output layer of the extractor 35 (in the example of FIG. 4B, on the output side most). Obtained from the pooling layer 352).
- control unit 11 inputs the acquired feature amount 5021 and the learning true value information 503 to the input layer of the coupler 36 (the fully coupled layer 361 arranged on the most input side), and sequentially of the coupler 36. Performs propagation arithmetic processing. By this arithmetic processing, the control unit 11 acquires an output value corresponding to the feature amount 504 related to calibration from the output layer of the coupler 36 (the fully coupled layer 361 arranged on the most output side).
- the control unit 11 can acquire the learning calibration information 50 composed of the feature amount 504 by using the extractor 35 and the coupler 36 by these arithmetic processes. As described above, when a plurality of predetermined directions are set, the control unit 11 may acquire the learning reference image 501 and the learning true value information 503 for each of the plurality of different predetermined directions. Then, the control unit 11 executes arithmetic processing of the extractor 35 and the coupler 36 for each of the learning calibrations including the learning feature information 502 and the learning true value information 503 in each of the plurality of different predetermined directions. Information 50 may be acquired. When the learning calibration information 50 is acquired, the control unit 11 proceeds to the next step S105.
- Step S105 the control unit 11 operates as the second acquisition unit 113, and acquires a plurality of learning data sets 51 each composed of a combination of the learning target image 53 and the correct answer information 55.
- control unit 11 may use at least one of the collected plurality of data sets 120 as the learning data set 51. That is, even if the control unit 11 acquires the learning image 121 of the data set 120 as the learning target image 53 of the learning data set 51 and the correct answer information 123 of the data set 120 as the correct answer information 55 of the learning data set 51. Good.
- the control unit 11 may acquire each learning data set 51 separately from the data set 120.
- the method of acquiring each training data set 51 may be the same as the method of generating the data set 120.
- the number of learning data sets 51 to be acquired is not particularly limited, and may be appropriately determined according to the embodiment.
- the control unit 11 proceeds to the next step S106.
- the timing of executing the process of step S105 does not have to be limited to such an example.
- the process of step S105 may be executed at an arbitrary timing before the process of step S106 described later is executed.
- Step S106 the control unit 11 operates as the machine learning unit 114, and uses the acquired plurality of learning data sets 51 to perform machine learning of the estimation model 3.
- the control unit 11 estimates that each learning data set 51 outputs an output value that matches the corresponding correct answer information 55 with respect to the input of the learning target image 53 and the learning calibration information 50. Train model 3.
- control unit 11 uses the learning target image 53, the learning reference image 501, and the learning true value information 503 of each learning data set 51 as training data, and the correct answer information of each learning data set 51.
- the training process of the learning model 30 including the estimation model 3 is executed.
- a stochastic gradient descent method, a mini-batch gradient descent method, or the like may be used.
- control unit 11 inputs the learning target image 53 included in each learning data set 51 into the input layer of the extractor 31 (in the example of FIG. 4B, the convolution layer 311 arranged on the input side most) and extracts the learning target image 53.
- the forward propagation arithmetic processing of the device 31 is executed.
- the control unit 11 outputs the output value corresponding to the feature amount 54 extracted from the learning target image 53 to the output layer of the extractor 31 (in the example of FIG. 4B, the pooling layer arranged on the output side most). Obtained from 312).
- control unit 11 transfers the feature amount 504 obtained from the coupler 36 and the feature amount 54 obtained from the extractor 31 to the input layer of the estimator 32 (the fully connected layer 321 arranged on the most input side). Input and execute the arithmetic processing of the forward propagation of the estimator 32. By this arithmetic processing, the control unit 11 calculates the output value corresponding to the result of estimating the line-of-sight direction of the subject reflected in the learning target image 53 in the output layer of the estimator 32 (the fully connected layer 321 arranged on the most output side). Get from.
- the control unit 11 calculates the error between the output value obtained from the output layer of the estimator 32 and the corresponding correct answer information 55. Similar to the machine learning of the learning model 4, an arbitrary loss function may be used to calculate the error.
- the control unit 11 uses the gradient of the error of the output value calculated by the back-propagation method to determine the error of the value of each calculation parameter of each extractor (31, 35), the coupler 36, and the estimator 32. Calculate in order from the output side.
- the control unit 11 updates the values of the calculation parameters of the extractors (31, 35), the coupler 36, and the estimator 32 based on the calculated errors. Similar to the machine learning of the learning model 4, the degree to which the value of each calculation parameter is updated may be adjusted by the learning rate.
- the control unit 11 executes the series of update processes while accompanied by the calculation of the feature amount 504 in step S104 and the arithmetic processing of the estimation model 3. As a result, the control unit 11 reduces the sum of the errors of the calculated output values for the learning reference image 501, the learning true value information 503, and each learning data set 51, so that the sum of the errors is small. 35), the value of each calculation parameter of the coupler 36, and the estimator 32 is adjusted. Similar to the machine learning of the learning model 4, the control unit 11 performs each calculation parameter of each extractor (31, 35), the coupler 36, and the estimator 32 by the above series of update processing until a predetermined condition is satisfied. The adjustment of the value of may be repeated.
- the learning reference image 501, the learning true value information 503, and the plurality of learning data sets 51 obtained from the same subject are derived from each so as to be used for machine learning of the learning model 30.
- a subject may be identified.
- each extractor (31, 35) is trained to acquire the ability to extract a feature amount including a component capable of estimating the line-of-sight direction of a person from an image by machine learning of the learning model 4. Therefore, in the above update process, the process of adjusting the value of each calculation parameter of each extractor (31, 35) may be omitted.
- the process of step S104 may be executed at an arbitrary timing before executing the arithmetic process of the estimator 32. For example, the process of step S104 may be executed after the arithmetic process of the extractor 31 is executed.
- the control unit 11 has acquired the ability to appropriately estimate the line-of-sight direction of a person from the learning reference image 501, the learning true value information 503, and the learning target image 53 for each learning data set 51.
- the trained learning model 30 can be generated. That is, the control unit 11 can generate a trained coupler 36 that has acquired the ability to derive a calibration feature amount useful for estimating the line-of-sight direction of a person for each learning data set 51. Further, the control unit 11 has an ability to appropriately estimate the line-of-sight direction of the person appearing in the corresponding image from the feature amount of the image obtained by the extractor 31 and the calibration feature amount obtained by the coupler 36 for each learning data set 51.
- the trained estimator 32 that has acquired the above can be generated.
- Step S107 the control unit 11 operates as the storage processing unit 115, and learns information about the trained learning model 30 (estimation model 3, extractor 35, and coupler 36) generated by machine learning as learning result data 125. Generate as. Then, the control unit 11 stores the generated learning result data 125 in a predetermined storage area.
- the predetermined storage area may be, for example, a RAM in the control unit 11, a storage unit 12, an external storage device, a storage medium, or a combination thereof.
- the storage medium may be, for example, a CD, a DVD, or the like, and the control unit 11 may store the learning result data 125 in the storage medium via the drive 17.
- the external storage device may be, for example, a data server such as NAS (Network Attached Storage). In this case, the control unit 11 may store the learning result data 125 in the data server via the network by using the communication interface 13. Further, the external storage device may be, for example, an external storage device connected to the model generation device 1 via the external interface 14.
- control unit 11 ends the process related to this operation example.
- the generated learning result data 125 may be applied to the line-of-sight estimation device 2 at an arbitrary timing.
- the control unit 11 may transfer the learning result data 125 to the line-of-sight estimation device 2 as the process of step S107 or separately from the process of step S107.
- the line-of-sight estimation device 2 may acquire the learning result data 125 by receiving this transfer.
- the line-of-sight estimation device 2 may acquire the learning result data 125 by accessing the model generation device 1 or the data server via the network using the communication interface 23.
- the line-of-sight estimation device 2 may acquire the learning result data 125 via the storage medium 92.
- the learning result data 125 may be incorporated in the line-of-sight estimation device 2 in advance.
- control unit 11 may update or newly create the learning result data 125 by repeating the processes of steps S101 to S107 (or steps S104 to S107) periodically or irregularly. At the time of this repetition, at least a part of the data used for machine learning may be changed, modified, added, deleted, or the like as appropriate. Then, the control unit 11 may update the learning result data 125 held by the line-of-sight estimation device 2 by providing the updated or newly generated learning result data 125 to the line-of-sight estimation device 2 by an arbitrary method. ..
- FIG. 7 is a flowchart showing an example of the processing procedure of the line-of-sight estimation device 2 according to the present embodiment.
- the processing procedure described below is an example of the line-of-sight estimation method. However, the processing procedure described below is only an example, and each step may be changed as much as possible. Further, with respect to the processing procedure described below, steps can be omitted, replaced, and added as appropriate according to the embodiment.
- Step S201 the control unit 21 operates as the information acquisition unit 211 to acquire the calibration information 60 including the feature information 602 and the true value information 603.
- FIG. 8 schematically illustrates an example of a method of acquiring the calibration information 60.
- the control unit 21 outputs an instruction to the target person R to look in a predetermined direction.
- the output device 26 includes a display 261.
- the control unit 21 displays the marker M at a position corresponding to a predetermined direction on the display 261.
- the control unit 21 outputs an instruction to the target person R to look toward the marker M displayed on the display 261.
- the output format of the instruction may be appropriately selected according to the embodiment.
- the output device 26 includes a speaker
- the output of the instruction may be performed by voice through the speaker.
- the output of the instruction may be performed by displaying an image via the display device.
- the control unit 21 takes a picture of the face of the subject R looking toward the marker M with the camera S.
- the camera S is an example of a sensor capable of observing the line of sight of the subject R.
- the control unit 21 can acquire the reference image 601 in which the eyes of the target person looking in a predetermined direction are captured. Further, the control unit 21 can naturally acquire the true value information 603 according to the instruction to be output.
- the index in the predetermined direction is not limited to the marker M displayed on the display 261 and may be appropriately determined according to the embodiment.
- the positional relationship between the installed object such as the rearview mirror and the camera S is defined.
- the control unit 21 may output an instruction to look at the object.
- the control unit 21 instructs the target person R to look at the object. It may be output.
- the reference image 601 showing the individuality of the line of sight of the subject R and the corresponding true value information 603 can be appropriately and easily acquired.
- the predetermined directions may not completely match between the scene of model generation and the scene of estimating the line-of-sight direction (operation scene).
- a plurality of different predetermined directions are set, and in the operation scene, data in at least one of the predetermined directions (reference image 601 and true value information 603 in this embodiment) are randomly selected. May be selected.
- control unit 21 sets the trained extractor 35 and the coupler 36 with reference to the learning result data 125.
- the control unit 21 inputs the acquired reference image 601 to the input layer of the learned extractor 35, and executes the arithmetic processing of the forward propagation of the extractor 35.
- the control unit 21 acquires the output value corresponding to the feature amount 6021 (feature information 602) related to the reference image 601 from the output layer of the learned extractor 35.
- the control unit 21 inputs the acquired feature amount 6021 and the true value information 603 to the input layer of the learned coupler 36, and executes the arithmetic processing of the forward propagation of the coupler 36.
- the control unit 21 acquires an output value corresponding to the feature amount 604 related to calibration from the output layer of the trained coupler 36.
- the control unit 21 can acquire the calibration information 60 composed of the feature amount 604 by using the learned extractor 35 and the coupler 36 by these arithmetic processes.
- the calibration information 60 may include the feature information 602 and the true value information 603 corresponding to each of a plurality of different predetermined directions in accordance with the generation process of the trained estimation model 3.
- the control unit 21 may acquire the reference image 601 and the true value information 603 for each of the plurality of different predetermined directions by executing the acquisition process (FIG. 8) for each different predetermined direction. Good. Then, the control unit 21 executes the arithmetic processing of the extractor 35 and the coupler 36 that have been learned for each of them, so that the calibration information 60 (including the feature information 602 and the true value information 603 of each of the plurality of different predetermined directions) is included. The feature amount 604) may be acquired. When the calibration information 60 is acquired, the control unit 21 proceeds to the next step S202.
- Step S202 the control unit 21 operates as the image acquisition unit 212 to acquire the target image 63 in which the eyes of the target person R are captured.
- the control unit 21 controls the operation of the camera S so as to photograph the target person R via the external interface 24.
- the control unit 21 can directly acquire the target image 63, which is the target of the line-of-sight direction estimation process, from the camera S.
- the target image 63 may be a moving image or a still image.
- the route for acquiring the target image 63 does not have to be limited to such an example.
- the camera S may be controlled by another computer.
- the control unit 21 may indirectly acquire the target image 63 from the camera S via another computer.
- the control unit 21 proceeds to the next step S203.
- Step S203 the control unit 21 operates as the estimation unit 213 and uses the learned estimation model 3 to estimate the line-of-sight direction of the eyes of the target person R reflected in the target image 63.
- the control unit 21 inputs the acquired target image 63 and the calibration information 60 into the trained estimation model 3 and executes the arithmetic processing of the trained estimation model 3.
- the control unit 21 acquires an output value corresponding to the result of estimating the eye-gaze direction of the target person R reflected in the target image 63 from the trained estimation model 3.
- the control unit 21 sets the trained extractor 31 and the estimator 32 with reference to the learning result data 125.
- the control unit 21 inputs the acquired target image 63 to the input layer of the learned extractor 31, and executes the arithmetic processing of the forward propagation of the extractor 31.
- the control unit 21 acquires an output value corresponding to the feature amount 64 related to the target image 63 from the output layer of the learned extractor 31.
- the control unit 21 inputs the feature amount 604 acquired in step S201 and the feature amount 64 acquired from the extractor 31 into the input layer of the trained estimator 32, and calculates the forward propagation of the estimator 32. Execute the process.
- the control unit 21 can acquire an output value corresponding to the result of estimating the eye-gaze direction of the target person R reflected in the target image 63 from the output layer of the trained estimator 32. That is, in the present embodiment, estimating the line-of-sight direction of the target person R reflected in the target image 63 gives the target image 63 and the calibration information 60 to the trained estimation model 3, and causes forward propagation of the trained estimation model 3. It is achieved by executing arithmetic processing.
- the process of step S201 may be executed at an arbitrary timing before the arithmetic process of the estimator 32 is executed. For example, after executing the arithmetic processing of the learned extractor 31, the processing of step S201 may be executed.
- the control unit 21 proceeds to the next step S204.
- Step S204 the control unit 21 operates as the output unit 214 and outputs information regarding the result of estimating the line-of-sight direction of the target person R.
- the output destination and the content of the information to be output may be appropriately determined according to the embodiment.
- the control unit 21 may output the result of estimating the line-of-sight direction to, for example, a memory such as a RAM or a storage unit 22 or an output device 26 as it is.
- the control unit 21 may create a history of the line-of-sight direction of the target person R by outputting the result of estimating the line-of-sight direction to the memory.
- control unit 21 may execute some information processing by using the result of estimating the line-of-sight direction. Then, the control unit 21 may output the result of executing the information processing as information regarding the estimation result.
- the control unit 21 may determine whether or not the driver is looking away based on the estimated gaze direction. Then, when it is determined that the driver is looking away, the control unit 21 performs a process of instructing the driver to look in an appropriate direction for driving or reducing the traveling speed of the vehicle as an output process of step S204. You may do it.
- the control unit 21 executes an application corresponding to the icon existing in the estimated line-of-sight direction as the output process of step S204, or the display object existing in the estimated line-of-sight direction is at the center of the display device. You may execute the process of changing the display range so that it comes.
- the control unit 21 proceeds to the next step S205.
- Step S205 it is determined whether or not to repeat the estimation process in the line-of-sight direction.
- the criteria for determining whether or not to repeat the estimation process may be appropriately determined according to the embodiment.
- a period or number of times for repeating the process may be set.
- the control unit 21 may determine whether or not to repeat the line-of-sight direction estimation process depending on whether or not the period or number of times of executing the line-of-sight direction estimation process has reached a predetermined value. That is, if the period or number of times the estimation process is executed has not reached the specified value, the control unit 21 may determine that the estimation process in the line-of-sight direction is repeated. On the other hand, when the period or the number of times the estimation process is executed reaches the specified value, the control unit 21 may determine that the process in the line-of-sight direction is not repeated.
- control unit 21 may repeat the estimation process in the line-of-sight direction until an end instruction is given via the input device 25. In this case, the control unit 21 may determine that the line-of-sight direction estimation process is repeated while the end instruction is not given. On the other hand, after the end instruction is given, the control unit 21 may determine that the line-of-sight direction estimation process is not repeated.
- the control unit 21 When it is determined that the line-of-sight direction estimation process is repeated, the control unit 21 returns the process to step S202 and repeats the acquisition process of the target image 63 (step S202) and the line-of-sight direction estimation process of the target person R (step S203). Execute. As a result, the line-of-sight direction of the subject R can be continuously estimated. On the other hand, when it is determined that the line-of-sight direction estimation process is not repeated, the control unit 21 stops the repeated execution of the line-of-sight direction estimation process and ends the processing procedure according to this operation example.
- step S201 If the calibration information 60 (feature amount 604) has been derived in step S201, the already derived calibration information 60 is used in each cycle of executing the line-of-sight direction estimation process unless the calibration information 60 is updated. Can be reused. Therefore, as in the present embodiment, the process of step S201 may be omitted in each cycle of executing the line-of-sight direction estimation process. However, it is not always necessary to omit the process of step S201 in all the cycles for executing the line-of-sight direction estimation process. When updating the calibration information 60, step S201 may be executed again at an arbitrary timing. Further, the process of step S204 may be omitted in at least a part of the cycles.
- step S203 in order to estimate the line-of-sight direction of the subject R, calibration including not only the target image 63 in which the eyes of the subject R are captured but also the feature information 602 and the true value information 603 is included. Information 60 is used. According to the feature information 602 and the true value information 603, the individuality of the line of sight of the subject R in a known direction can be grasped from the true value. Therefore, according to the present embodiment, the line-of-sight direction of the subject R reflected in the target image 63 can be estimated in consideration of the individual difference between the subject and the subject R that can be grasped from the calibration information 60.
- the calibration information 60 may include the feature information 602 and the true value information 603 corresponding to each of a plurality of different predetermined directions. Thereby, the individuality of the line of sight of the subject R can be more accurately grasped from the calibration information 60 for a plurality of different predetermined directions. Therefore, it is possible to further improve the accuracy of estimating the line-of-sight direction of the subject R.
- the trained estimation model 3 capable of estimating the line-of-sight direction of the subject R with such high accuracy can be generated by the processes of steps S101 to S107.
- the reference image 601 and the true value information 603 are not used as they are as the calibration information 60, but the feature amount 6021 is extracted from the reference image 601 and the obtained feature information 602 and the true value information 603 are used.
- the feature amount 604 derived by combining is used as the calibration information 60. Thereby, the amount of information of the calibration information 60 can be reduced.
- the derivation of the feature amount 604 is executed within the process of step S201. When the process of estimating the line-of-sight direction of the subject R is repeated, the derived feature amount 604 can be reused in each cycle. As a result, the processing cost of step S203 can be suppressed. Therefore, according to the present embodiment, it is possible to speed up the process of estimating the line-of-sight direction of the target person R in step S203.
- the trained extractor 35 can appropriately extract the feature amount 6021 (feature information 602) including the component related to the gaze feature of the subject R who looks in a predetermined direction from the reference image 601.
- the feature amount 6021 and the true value information 603 include a feature amount 604 including a feature of the line of sight of the subject R who looks in a predetermined direction and a component in which the true values of the predetermined direction are aggregated by the learned coupler 36. Can be appropriately derived from. Therefore, in the trained estimation model 3, the line-of-sight direction of the target person R can be appropriately estimated from the feature amount 604 and the target image 63.
- the camera S is used to acquire the calibration information 60.
- the sensor for observing the line of sight of the subject R does not have to be limited to such an example.
- the type of the sensor is not particularly limited as long as it can observe the characteristics of the line of sight of the subject R, and may be appropriately selected according to the embodiment.
- the sensor for example, a scleral contact lens containing a coil, an electro-oculography sensor, or the like may be used.
- the line-of-sight estimation device 2 may observe the line-of-sight of the target person R with a sensor after outputting an instruction to the target person R to look in a predetermined direction, as in the above embodiment.
- Feature information 602 can be acquired from the sensing data obtained by this observation.
- a search coil method, an EOG (electro-oculogram) method, or the like may be used to acquire the feature information 602.
- the estimation model 3 is composed of an extractor 31 and an estimator 32.
- the calibration information 60 is composed of the feature amount 604 derived from the reference image 601 and the true value information 603 by using the extractor 35 and the coupler 36.
- the estimator 32 is configured to receive inputs of the feature amount 604 derived by the coupler 36 and the feature amount 64 related to the target image 63.
- the configuration of the estimation model 3 and the calibration information 60 does not have to be limited to such an example.
- the estimation model 3 may further include a coupler 36.
- the calibration information 60 may be composed of the feature information 602 and the true value information 603.
- the feature amount 6021 (feature information 602) related to the reference image 601 is obtained by acquiring the reference image 601 and inputting the reference image 601 into the extractor 35 and executing the arithmetic processing of the extractor 35. It may be configured by acquiring and acquiring the true value information 603.
- the process of step S203 may further include a process of deriving the feature amount 604 from the feature amount 6021 and the true value information 603 by using the coupler 36.
- the estimation model 3 may further include an extractor 35 and a coupler 36.
- the feature information 602 may be composed of the reference image 601.
- the process of step S201 may be configured by acquiring the reference image 601 and the true value information 603.
- the calibration information 60 may be composed of the reference image 601 and the true value information 603.
- the process of step S203 may further include a process of deriving the feature amount 604 from the reference image 601 and the true value information 603 by using the extractor 35 and the coupler 36.
- the extractor 35 may be omitted.
- the control unit 21 may directly acquire the feature information 602.
- the process of extracting the feature amount 6021 from the reference image 601 may be executed by another computer.
- the control unit 21 may acquire the feature amount 6021 from another computer.
- the feature information 602 may be composed of the reference image 601.
- the coupler 36 may be configured to accept the input of the reference image 601 and the true value information 603.
- FIG. 9 schematically illustrates an example of the software configuration of the model generation device 1 that generates the estimation model 3A according to the first modification.
- FIG. 10 schematically illustrates an example of the software configuration of the line-of-sight estimation device 2 using the estimation model 3A according to the first modification.
- the coupler 36 is omitted.
- the process of deriving the calibration feature amount from the feature information and the true value information is omitted.
- the estimator 32A is configured to accept input of feature information, true value information, and feature quantities related to the target image. That is, the feature information and the true value information are directly input to the estimator 32A, not the calibration feature amount. Except for these points, the first modification is configured in the same manner as in the above embodiment.
- the estimator 32A comprises one or more fully connected layers 321A as in the above embodiment.
- the estimation model 3A is composed of an extractor 31 and an estimator 32A.
- the model generator 1 has learned the estimation model 3A by the same processing procedure as that of the above embodiment except that the training processing of the coupler 36 is omitted. (Extractor 31 and estimator 32A) and extractor 35 can be generated.
- the control unit 11 generates information about the trained estimation model 3A and the extractor 35 generated by machine learning as learning result data 125A. Then, the control unit 11 stores the generated learning result data 125A in a predetermined storage area.
- the learning result data 125A may be provided to the line-of-sight estimation device 2 at an arbitrary timing.
- the line-of-sight estimation device 2 estimates the line-of-sight direction of the subject R by the same processing procedure as in the above embodiment except that the arithmetic processing of the coupler 36 is omitted. Can be done.
- the control unit 21 acquires the reference image 601 and the true value information 603.
- the control unit 21 inputs the acquired reference image 601 to the extractor 35, and executes the arithmetic processing of the extractor 35.
- the control unit 21 acquires an output value corresponding to the feature amount 6021 (feature information 602) related to the reference image 601 from the extractor 35.
- the calibration information 60 is composed of the feature amount 6021 (feature information 602) and the true value information 603.
- step S203 the control unit 21 estimates the line-of-sight direction of the target person R reflected in the target image 63 by using the learned estimation model 3A. Specifically, the control unit 21 inputs the acquired target image 63 to the extractor 31 and executes arithmetic processing of the extractor 31. By this arithmetic processing, the control unit 21 acquires the feature amount 64 related to the target image 63 from the extractor 31. Next, the control unit 21 inputs the feature amount 6021 (feature information 602), the true value information 603, and the feature amount 64 into the estimator 32A, and executes the arithmetic processing of the estimator 32A. By this arithmetic processing, the control unit 21 can acquire an output value corresponding to the result of estimating the line-of-sight direction of the target person R reflected in the target image 63 from the estimator 32A.
- the line-of-sight direction of the target person R is appropriately determined from the feature information 602 (feature amount 6021), the true value information 603, and the target image 63, as in the above embodiment. It can be estimated.
- the feature information 602 and the true value information 603 it is possible to improve the accuracy of estimating the line-of-sight direction of the target person R in step S203.
- the feature amount 6021 (feature information 602) derived in step S201 can be reused in each cycle. By this amount, the speed of the process of estimating the line-of-sight direction of the target person R in step S203 can be increased.
- the extractor 35 may be omitted in the line-of-sight estimation device 2 also in this first modification.
- the control unit 21 may directly acquire the feature information 602 in the same manner as described above.
- the estimator 32A may be configured to accept inputs of the reference image 601, the true value information 603, and the feature amount 64.
- FIG. 11 schematically illustrates an example of the software configuration of the model generation device 1 that generates the estimation model 3B according to the second modification.
- FIG. 12 schematically illustrates an example of the software configuration of the line-of-sight estimation device 2 using the estimation model 3A according to the second modification.
- the estimation model 3B further includes an extractor 35. That is, the estimation model 3B includes each extractor (31, 35) and an estimator 32B.
- the feature information is composed of a reference image.
- the second modification is configured in the same manner as the first modification.
- the estimator 32B is configured in the same manner as the estimator 32A.
- the estimator 32B includes one or more fully connected layers 321B as in the first modification.
- the model generation device 1 can generate the trained estimation model 3B by the same processing procedure as in the first modification.
- the control unit 11 generates information about the learned estimation model 3B generated by machine learning as learning result data 125B. Then, the control unit 11 stores the generated learning result data 125B in a predetermined storage area.
- the learning result data 125B may be provided to the line-of-sight estimation device 2 at an arbitrary timing.
- the line-of-sight estimation device 2 can estimate the line-of-sight direction of the subject R by the same processing procedure as in the first modification.
- the control unit 21 acquires the reference image 601 and the true value information 603.
- the control unit 21 estimates the line-of-sight direction of the target person R reflected in the target image 63 by using the learned estimation model 3B. Specifically, the control unit 21 inputs the acquired target image 63 to the extractor 31 and executes arithmetic processing of the extractor 31. By this arithmetic processing, the control unit 21 acquires the feature amount 64 related to the target image 63 from the extractor 31.
- control unit 21 inputs the acquired reference image 601 to the extractor 35 and executes the arithmetic processing of the extractor 35. As a result, the control unit 21 acquires an output value corresponding to the feature amount 6021 related to the reference image 601 from the extractor 35.
- the processing order of each extractor (31, 35) may be arbitrary.
- the control unit 21 inputs the feature amount 6021, the true value information 603, and the feature amount 64 into the estimator 32B, and executes the arithmetic processing of the estimator 32B. By this arithmetic processing, the control unit 21 can acquire an output value corresponding to the result of estimating the line-of-sight direction of the target person R reflected in the target image 63 from the estimator 32B.
- the line-of-sight direction of the target person R is appropriately estimated from the reference image 601 (feature information), the true value information 603, and the target image 63, as in the above embodiment. It is possible. By using the feature information and the true value information 603, it is possible to improve the accuracy of estimating the line-of-sight direction of the target person R in step S203.
- FIG. 13A and 13B schematically show an example of the software configuration of the model generation device 1 that generates the estimation model 3C according to the third modification.
- FIG. 14 schematically illustrates an example of the software configuration of the line-of-sight estimation device 2 using the estimation model 3C according to the third modification.
- a heat map is used to express the line-of-sight direction in the feature amount and the like.
- the heat map represents the direction in which a person gazes with an image.
- the value of each pixel in the heat map corresponds to, for example, the degree to which a person gazes at the position. When the total value of each pixel is normalized to 1, the value of each pixel can indicate the probability that a person is gazing at the position.
- each extractor (31, 35, 41) is replaced by each converter (31C, 35C, 41C).
- the converter 31C is an example of the first converter
- the converter 35C is an example of the second converter.
- Each converter (31C, 35C, 41C) is configured to accept an input of an image in which a person's eyes appear and output a heat map relating to the line-of-sight direction of the person derived from the input image. That is, each converter (31C, 35C, 41C) is configured to convert an image of a person's eyes into a heat map with respect to the line-of-sight direction.
- the transducer 41C includes a convolution layer 415, a pooling layer 416, an amplifier ring layer 417, and a deconvolution layer 418.
- the amplifiering layer 417 is configured to perform an inverse operation of the pooling process of the pooling layer 416.
- the deconvolution layer 418 is configured to perform the inverse operation of the convolution operation of the convolution layer 415.
- each layer 415 to 418 may be appropriately determined according to the embodiment.
- the ampouling layer 417 and the deconvolution layer 418 are arranged on the output side of the convolution layer 415 and the pooling layer 416.
- the convolutional layer 415 arranged on the input side most constitutes an input layer
- the deconvolution layer 418 arranged on the output side most constitutes an output layer.
- the structure of the converter 41C is not limited to such an example, and may be appropriately determined according to the embodiment.
- the converter 41C may include other types of layers such as a normalized layer and a dropout layer.
- each converter (31C, 35C) is derived from the converter 41C.
- a common converter may be used for each converter (31C, 35C), or a separate converter may be used.
- the estimation model 3C includes a converter 31C and an estimation device 32C.
- the feature information is composed of a heat map relating to the line-of-sight direction of the eyes looking in a predetermined direction, which is derived from a reference image in which the eyes of a person (object R) looking in a predetermined direction are captured.
- the estimator 32C is configured to receive input of heat map, feature information, and true value information derived from the target image, and output an output value corresponding to the result of estimating the line-of-sight direction of the person appearing in the target image.
- the feature information is composed of a heat map derived from the reference image and relating to the line-of-sight direction of the eye looking in a predetermined direction.
- the true value information is converted into a heat map regarding the true value in a predetermined direction.
- accepting the input of the heat map, feature information, and true value information derived from the target image means that the heat map derived from the target image, the heat map derived from the reference image (feature information), etc. And by accepting the heat map obtained from the true value information.
- the estimator 32C includes a connecting layer 325, a convolution layer 326, and a conversion layer 327 in this order from the input side.
- the connection layer 325 is configured to connect each input heat map.
- the conversion layer 327 is configured to convert the output obtained from the convolution layer 326 into an estimation result in the line-of-sight direction.
- the connection layer 325 and the conversion layer 327 may be appropriately composed of a plurality of neurons (nodes).
- the structure of the estimator 32C does not have to be limited to such an example, and may be appropriately determined according to the embodiment.
- the estimator 32C may include other types of layers such as a pooling layer and a fully connected layer.
- the model generation device 1 generates the trained estimation model 3C by the same processing procedure as that of the above embodiment. Further, the line-of-sight estimation device 2 estimates the line-of-sight direction of the subject R by using the learned estimation model 3C in the same processing procedure as in the above embodiment.
- step S102 the control unit 11 uses the plurality of data sets 120 to perform machine learning of the converter 41C.
- the control unit 11 inputs the learning image 121 of each data set 120 to the converter 41C, and executes the arithmetic processing of the converter 41C.
- the control unit 11 acquires the output value corresponding to the heat map converted from the learning image 121 from the converter 41C.
- control unit 11 converts the corresponding correct answer information 123 into the heat map 129.
- the method of converting the correct answer information 123 into the heat map 129 may be appropriately selected depending on the embodiment. For example, the control unit 11 prepares an image having the same size as the heat map output by the converter 41C. Subsequently, the control unit 11 arranges a predetermined distribution such as a Gaussian distribution centered on the position corresponding to the true value in the line-of-sight direction indicated by the correct answer information 123 in the prepared image. The maximum value of the distribution may be determined as appropriate. As a result, the correct answer information 123 can be converted into the heat map 129.
- control unit 11 calculates an error between the output value obtained from the converter 41C and the heat map 129. Subsequent machine learning processing may be the same as in the above embodiment.
- the control unit 11 calculates the error of the value of each calculation parameter of the converter 41C in order from the output side by using the gradient of the error of the output value calculated by the error back propagation method, and based on the calculated error. , Update the value of each operation parameter.
- the control unit 11 adjusts the value of each calculation parameter of the converter 41C so that the sum of the errors of the calculated output values becomes small for each data set 120 by the above series of update processes.
- the control unit 11 may repeatedly adjust the value of each calculation parameter of the converter 41C until a predetermined condition is satisfied.
- the control unit 11 generates, for each dataset 120, a trained transducer 41C that has acquired the ability to appropriately convert an image of a person's eyes into a heat map with respect to the direction of the line of sight of the person. be able to.
- step S103 the control unit 11 diverts the converter 41C to each converter (31C, 35C). As a result, the control unit 11 prepares a learning model composed of the estimation model 3C and the converter 35C.
- step S104 the control unit 11 acquires the learning feature information 502C by using the converter 35C. Specifically, the control unit 11 acquires the learning reference image 501 and the learning true value information 503 as in the above embodiment. Next, the control unit 11 inputs the acquired learning reference image 501 to the converter 35C, and executes the arithmetic processing of the converter 35C. By this arithmetic processing, the control unit 11 acquires an output value corresponding to the learning heat map 5021C regarding the line-of-sight direction of the eye looking in a predetermined direction derived from the learning reference image 501 from the converter 35C. In this modification, the learning feature information 502C is composed of the heat map 5021C.
- control unit 11 converts the learning true value information 503 into a heat map 5031.
- the same method as the method for converting the correct answer information 123 into the heat map 129 may be used.
- the control unit 11 can acquire calibration information for learning composed of two heat maps (5021C and 5031). Similar to the above embodiment, the control unit 11 may acquire the learning reference image 501 and the learning true value information 503 for each of the plurality of different predetermined directions. Then, the control unit 11 may acquire heat maps (5021C, 5031) for each of a plurality of different predetermined directions by executing each arithmetic processing. In step S105, the control unit 11 acquires a plurality of learning data sets 51 as in the embodiment.
- step S106 the control unit 11 performs machine learning of the estimation model 3 using the acquired plurality of learning data sets 51.
- the control unit 11 inputs the learning target image 53 of each learning data set 51 to the converter 31C, and executes the arithmetic processing of the converter 31C.
- the control unit 11 acquires the output value corresponding to the heat map 54C converted from the learning target image 53 from the converter 31C.
- the control unit 11 inputs each heat map (5021C, 5031, 54C) to the estimator 32C, and executes the arithmetic processing of the estimator 32C.
- the control unit 11 acquires an output value corresponding to the result of estimating the line-of-sight direction of the subject reflected in the learning target image 53 from the estimator 32C.
- control unit 11 calculates the error between the output value obtained from the estimator 32C and the corresponding correct answer information 55. Subsequent machine learning processing may be the same as in the above embodiment.
- the control unit 11 calculates the error of the value of each calculation parameter of the learning model in order from the output side by using the gradient of the error of the output value calculated by the error back propagation method, and based on the calculated error, Update the value of each operation parameter.
- the control unit 11 executes the above-mentioned series of update processes while accompanied by the arithmetic processing of the converter 35C and the arithmetic processing of the estimation model 3C, thereby performing the learning reference image 501, the learning true value information 503, and each learning data.
- the value of each calculation parameter of the learning model is adjusted so that the sum of the errors of the calculated output values becomes small.
- the control unit 11 may repeat the adjustment of the value of each calculation parameter of the learning model until a predetermined condition is satisfied.
- the control unit 11 has acquired the ability to appropriately estimate the line-of-sight direction of a person from the learning reference image 501, the learning true value information 503, and the learning target image 53 for each learning data set 51. It is possible to generate a trained learning model.
- the learning reference image 501, the learning true value information 503, and the plurality of learning data sets 51 obtained from the same subject are used for machine learning of the learning model, respectively.
- the subject of origin may be identified.
- the heat map 5031 obtained from the learning true value information 503 may be reused, thereby converting the learning true value information 503 into the heat map 5031. May be omitted.
- the learning true value information 503 may be converted into the heat map 5031 in advance. Further, each converter (31C, 35C) is trained to acquire the ability to convert an image of a person's eyes into a heat map relating to the line-of-sight direction of the eyes by machine learning of the converter 41C.
- the process of adjusting the value of each calculation parameter of each converter (31C, 35C) may be omitted.
- the calculation result of each converter (31C, 35C) may be reused while the adjustment of the value of each calculation parameter is repeated. That is, the operations for deriving each heat map (5021C, 5031) do not have to be repeatedly executed.
- step S107 the control unit 11 generates information about the trained estimation model 3C and the converter 35C generated by machine learning as the learning result data 125C.
- the control unit 11 stores the generated learning result data 125C in a predetermined storage area.
- the learning result data 125C may be provided to the line-of-sight estimation device 2 at an arbitrary timing.
- the trained estimation model 3C includes a trained transducer 31C and an estimator 32C.
- step S201 the control unit 21 acquires the reference image 601 and the true value information 603.
- the control unit 21 inputs the acquired reference image 601 to the trained converter 35C, and executes arithmetic processing of the converter 35C.
- the control unit 21 acquires the output value corresponding to the heat map 6021C regarding the line-of-sight direction of the eye looking in the predetermined direction derived from the reference image 601 from the trained converter 35C.
- the heat map 6021C is an example of the second heat map.
- the feature information 602C is configured by this heat map 6021C.
- the control unit 21 converts the true value information 603 into a heat map 6031 regarding the true value in a predetermined direction.
- the heat map 6031 is an example of the third heat map.
- the control unit 21 can acquire calibration information composed of each heat map (6021C, 6031).
- the control unit 21 may acquire the reference image 601 and the true value information 603 for each of the plurality of different predetermined directions. Then, the control unit 21 may acquire heat maps (6021C, 6031) for each of a plurality of different predetermined directions by executing each arithmetic processing.
- step S203 the control unit 21 estimates the line-of-sight direction of the target person R reflected in the target image 63 by using the learned estimation model 3C. Specifically, the control unit 21 inputs the acquired target image 63 to the trained converter 31C, and executes arithmetic processing of the converter 31C. By this arithmetic processing, the control unit 21 acquires the output value corresponding to the heat map 64C regarding the line-of-sight direction of the target person R derived from the target image 63 from the trained converter 31C.
- the heat map 64C is an example of the first heat map.
- the control unit 21 inputs each heat map (6021C, 6031, 64) to the trained estimator 32C, and executes the arithmetic processing of the estimator 32C. By this arithmetic processing, the control unit 21 can acquire an output value corresponding to the result of estimating the line-of-sight direction of the target person R reflected in the target image 63 from the trained estimator 32C.
- the line-of-sight direction of the target person R can be appropriately estimated from the feature information 602C, the true value information 603, and the target image 63 in the trained estimation model 3C as in the above embodiment.
- the feature information 602C and the true value information 603 it is possible to improve the accuracy of estimating the line-of-sight direction of the target person R in step S203.
- the fully connected layer tends to have a large number of parameters and a low calculation speed as compared with the convolution layer.
- each converter (31C, 35C) and an estimator 32C can be configured without using the fully connected layer.
- the amount of information of the estimation model 3C can be made relatively small, and the processing speed of the estimation model 3C can be improved. Furthermore, by adopting a common heat map format as the data format on the input side, the configuration of the estimator 32C can be made relatively simple, and each information (feature information, true value information) in the estimator 32C can be made relatively simple. And the target image) can be easily integrated, so that the estimation accuracy of the estimator 32C can be expected to be improved.
- the configuration of the estimation model 3C does not have to be limited to such an example.
- the true value information 603 may be directly input to the estimator 32C without being converted into the heat map 6031.
- the feature information 602C may be input to the estimator 32C in a format different from that of the heat map 6021C.
- the feature information 602C may be input to the estimator 32C in the form of a feature amount as in the above embodiment.
- the feature information 602C and the true value information 603 may be combined before being input to the estimator 32C.
- the estimator 32C may output the estimation result in the line-of-sight direction in the form of a heat map.
- the conversion layer 327 may be omitted.
- the control unit 21 may specify the line-of-sight direction of the target person R according to the center of gravity of the heat map, the position of the pixel having the maximum value, and the like. It is easier to estimate the true heatmap from the training heatmap than to estimate the quantified values from the training heatmap, and generate a trained model with high estimation accuracy. It's easy to do. Therefore, by adopting the heat map as the data format of both the input side and the output side, it is possible to improve the estimation accuracy of the line-of-sight direction by the estimation model 3C.
- the detection result of facial organ points may be expressed in a heat map format.
- the heat map showing the estimation result in the line-of-sight direction can be merged with the heat map showing the detection result of the organ points of the face, and each result can be output in a single display.
- each estimation model can be configured as a single unit, and real-time performance can be enhanced.
- at least one of the true value information 603 and the feature information 602C may be input to the estimator 32C in a format different from that of the heat map.
- the converter 35C may be omitted.
- the control unit 21 may directly acquire the feature information 602C.
- the process of converting the reference image 601 into the heat map 6021C may be executed by another computer.
- the control unit 21 may acquire the heat map 6021C from another computer.
- the feature information 602C may be configured by the reference image 601. Accordingly, the estimator 32C may be configured to accept the input of the reference image 601.
- a convolutional neural network is used for each extractor (31, 35, 41).
- a fully connected neural network is used for each of the estimators (32, 43) and the coupler 36.
- the types of neural networks available for each extractor (31, 35, 41), each estimator (32, 43), and coupler 36 need not be limited to such examples.
- a fully connected neural network, a recurrent neural network, or the like may be used for each extractor (31, 35, 41).
- a convolutional neural network or a recurrent neural network may be used for each estimator (32, 43) and the coupler 36.
- each component does not necessarily have to be separated.
- a combination of two or more components may be composed of one neural network.
- the estimation model 3 extract 31 and estimator 32
- the estimation model 3 may be configured by one neural network.
- each extractor (31, 35, 41), each estimator (32, 43), and the coupler 36 need not be limited to the neural network.
- Other models such as support vector machines, regression models, decision tree models, etc. may be utilized for each extractor (31, 35, 41), each estimator (32, 43), and coupler 36. ..
- the trained estimation model 3, the extractor 35, and the coupler 36 may be generated by a computer other than the model generator 1.
- the process of step S102 may be omitted from the process procedure of the model generator 1.
- the processing of steps S103 to S107 may be omitted from the processing procedure of the model generation device 1.
- the first acquisition unit 112 and the second acquisition unit 113 may be omitted from the software configuration of the model generation device 1.
- the model generation device 1 may be omitted from the line-of-sight estimation system 100.
- the calibration information 60 may be given in advance, for example, by executing the process of step S201 within the initial setting process. In this case, the process of step S201 may be omitted from the process procedure of the line-of-sight estimation device 2. Further, if the calibration information 60 is not changed after the calibration information 60 is acquired, the learned extractor 35 and the coupler 36 may be omitted or deleted in the line-of-sight estimation device 2. At least part of the process of acquiring the calibration information 60 may be performed by another computer. In this case, the line-of-sight estimation device 2 may acquire the calibration information 60 by acquiring the calculation result of another computer.
- the line-of-sight estimation device 2 does not have to repeat the process of estimating the line-of-sight direction.
- the process of step S205 may be omitted from the process procedure of the line-of-sight estimation device 2.
- the data set 120 may not be used for acquiring each learning data set 51 and the learning calibration information 50.
- the processing of step S101 may be omitted from the processing procedure of the model generation device 1.
- the collection unit 111 may be omitted from the software configuration of the model generation device 1.
- Model generator 11 ... Control unit, 12 ... Storage unit, 13 ... Communication interface, 14 ... External interface, 15 ... Input device, 16 ... Output device, 17 ... Drive, 111 ... Collection Department, 112 ... First Acquisition Department, 113 ... 2nd acquisition department, 114 ... Machine learning department, 115 ... Preservation processing unit, 120 ... Data set, 121 ... Learning image, 123 ... Correct answer information, 125 ... Learning result data, 81 ... model generation program, 91 ... storage medium, 2 ... Line-of-sight estimation device, 21 ... Control unit, 22 ... Storage unit, 23 ... Communication interface, 24 ... External interface, 25 ... Input device, 26 ... Output device, 27 ... Drive, 211 ...
- Information acquisition department 212 ... Image acquisition department, 213 ... estimation unit, 214 ... output unit, 261 ... Display, M ... Mark, 82 ... line-of-sight estimation program, 92 ... storage medium, 30 ... learning model, 3 ... estimation model, 31 ... Extractor (first extractor), 311 ... Convolution layer, 312 ... Pooling layer, 32 ... estimator, 321 ... fully connected layer, 35 ... Extractor (second extractor), 351 ... Convolution layer, 352 ... Pooling layer, 36 ... binder, 361 ... fully bonded layer, 4 ... Learning model, 41 ... Extractor, 411 ... Convolution layer, 412 ... Pooling layer, 43 ... estimator, 431 ...
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Databases & Information Systems (AREA)
- Ophthalmology & Optometry (AREA)
- Medical Informatics (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
- Eye Examination Apparatus (AREA)
Abstract
Description
図1は、本発明を適用した場面の一例を模式的に例示する。図1に示されるとおり、本実施形態に係る視線推定システム100は、モデル生成装置1及び視線推定装置2を備えている。
[ハードウェア構成]
<モデル生成装置>
図2は、本実施形態に係るモデル生成装置1のハードウェア構成の一例を模式的に例示する。図2に示されるとおり、本実施形態に係るモデル生成装置1は、制御部11、記憶部12、通信インタフェース13、外部インタフェース14、入力装置15、出力装置16、及びドライブ17が電気的に接続されたコンピュータである。なお、図2では、通信インタフェース及び外部インタフェースを「通信I/F」及び「外部I/F」と記載している。
図3は、本実施形態に係る視線推定装置2のハードウェア構成の一例を模式的に例示する。図3に示されるとおり、本実施形態に係る視線推定装置2は、制御部21、記憶部22、通信インタフェース23、外部インタフェース24、入力装置25、出力装置26、及びドライブ27が電気的に接続されたコンピュータである。
<モデル生成装置>
図4A及び図4Bは、本実施形態に係るモデル生成装置1のソフトウェア構成の一例を模式的に例示する。モデル生成装置1の制御部11は、記憶部12に記憶されたモデル生成プログラム81をRAMに展開する。そして、制御部11は、RAMに展開されたモデル生成プログラム81に含まれる命令をCPUにより解釈及び実行して、各構成要素を制御する。これにより、図4A及び図4Bに示されるとおり、本実施形態に係るモデル生成装置1は、収集部111、第1取得部112、第2取得部113、機械学習部114、及び保存処理部115をソフトウェアモジュールとして備えるコンピュータとして動作する。すなわち、本実施形態では、モデル生成装置1の各ソフトウェアモジュールは、制御部11(CPU)により実現される。
各抽出器(31、35、41)、各推定器(32、43)、及び結合器36は、演算パラメータを有する機械学習可能なモデルにより構成される。それぞれに利用する機械学習モデルは、それぞれの演算処理を実行可能であれば、その種類は、特に限定されなくてよく、実施の形態に応じて適宜選択されてよい。本実施形態では、各抽出器(31、35、41)には、畳み込みニューラルネットワークが用いられる。また、各推定器(32、43)及び結合器36には、全結合型ニューラルネットワークが用いられる。
図5A及び図5Bは、本実施形態に係る視線推定装置2のソフトウェア構成の一例を模式的に例示する。視線推定装置2の制御部21は、記憶部22に記憶された視線推定プログラム82をRAMに展開する。そして、制御部21は、RAMに展開された視線推定プログラム82に含まれる命令をCPUにより解釈及び実行して、各構成要素を制御する。これにより、図5A及び図5Bに示されるとおり、本実施形態に係る視線推定装置2は、情報取得部211、画像取得部212、推定部213、及び出力部214をソフトウェアモジュールとして備えるコンピュータとして動作する。すなわち、本実施形態では、視線推定装置2の各ソフトウェアモジュールは、上記モデル生成装置1と同様に、制御部21(CPU)により実現される。
モデル生成装置1及び視線推定装置2の各ソフトウェアモジュールに関しては後述する動作例で詳細に説明する。なお、本実施形態では、モデル生成装置1及び視線推定装置2の各ソフトウェアモジュールがいずれも汎用のCPUによって実現される例について説明している。しかしながら、以上のソフトウェアモジュールの一部又は全部が、1又は複数の専用のプロセッサにより実現されてもよい。すなわち、上記各モジュールは、ハードウェアモジュールとして実現されてもよい。また、モデル生成装置1及び視線推定装置2それぞれのソフトウェア構成に関して、実施形態に応じて、適宜、ソフトウェアモジュールの省略、置換及び追加が行われてもよい。
[モデル生成装置]
図6は、本実施形態に係るモデル生成装置1の処理手順の一例を示すフローチャートである。以下で説明する処理手順は、モデル生成方法の一例である。ただし、以下で説明する処理手順は一例に過ぎず、各ステップは可能な限り変更されてよい。更に、以下で説明する処理手順について、実施の形態に応じて、適宜、ステップの省略、置換、及び追加が可能である。
ステップS101では、制御部11は、収集部111として動作し、複数の学習用のデータセット120を被験者から収集する。各データセット120は、被験者の目の写る学習画像121、及び当該学習画像121に写る被験者の視線方向の真値を示す正解情報123の組み合わせにより構成される。
ステップS102では、制御部11は、機械学習部114として動作し、収集された複数のデータセット120を使用して、学習モデル4の機械学習を実施する。この機械学習では、制御部11は、各データセット120について、学習画像121を抽出器41に入力することにより推定器43から得られる出力値(視線方向の推定結果)が、対応する正解情報123に適合するものとなるように、抽出器41及び推定器43を訓練する。なお、必ずしも収集された全てのデータセット120が学習モデル4の機械学習に使用されなければならない訳ではない。学習モデル4の機械学習に使用されるデータセット120は適宜選択されてよい。
ステップS103では、制御部11は、抽出器41の学習結果を利用して、推定モデル3を含む学習モデル30を用意する。
ステップS104では、制御部11は、第1取得部112として動作し、学習用特徴情報502及び学習用真値情報503を含む学習用較正情報50を取得する。
ステップS105では、制御部11は、第2取得部113として動作し、学習用対象画像53及び正解情報55の組み合わせによりそれぞれ構成される複数の学習データセット51を取得する。
ステップS106では、制御部11は、機械学習部114として動作し、取得された複数の学習データセット51を使用して、推定モデル3の機械学習を実施する。この機械学習では、制御部11は、各学習データセット51について、学習用対象画像53及び学習用較正情報50の入力に対して、対応する正解情報55に適合する出力値を出力するように推定モデル3を訓練する。
ステップS107では、制御部11は、保存処理部115として動作し、機械学習により生成された学習済みの学習モデル30(推定モデル3、抽出器35、及び結合器36)に関する情報を学習結果データ125として生成する。そして、制御部11は、生成された学習結果データ125を所定の記憶領域に保存する。
図7は、本実施形態に係る視線推定装置2の処理手順の一例を示すフローチャートである。以下で説明する処理手順は、視線推定方法の一例である。ただし、以下で説明する処理手順は一例に過ぎず、各ステップは可能な限り変更されてよい。更に、以下で説明する処理手順について、実施の形態に応じて、適宜、ステップの省略、置換、及び追加が可能である。
ステップS201では、制御部21は、情報取得部211として動作し、特徴情報602及び真値情報603を含む較正情報60を取得する。
ステップS202では、制御部21は、画像取得部212として動作し、対象者Rの目の写る対象画像63を取得する。本実施形態では、制御部21は、外部インタフェース24を介して、対象者Rを撮影するようにカメラSの動作を制御する。これにより、制御部21は、視線方向の推定処理の対象となる対象画像63をカメラSから直接的に取得することができる。対象画像63は、動画像であってもよいし、静止画像であってもよい。ただし、対象画像63を取得する経路は、このような例に限定されなくてもよい。例えば、カメラSは、他のコンピュータにより制御されてよい。この場合、制御部21は、他のコンピュータを介してカメラSから間接的に対象画像63を取得してもよい。対象画像63を取得すると、制御部21は、次のステップS203に処理を進める。
ステップS203では、制御部21は、推定部213として動作し、学習済み推定モデル3を利用して、対象画像63に写る対象者Rの目の視線方向を推定する。この推定処理では、制御部21は、取得された対象画像63及び較正情報60を学習済み推定モデル3に入力し、学習済み推定モデル3の演算処理を実行する。これにより、制御部21は、対象画像63に写る対象者Rの目の視線方向を推定した結果に対応する出力値を学習済み推定モデル3から取得する。
ステップS204では、制御部21は、出力部214として動作し、対象者Rの視線方向を推定した結果に関する情報を出力する。
ステップS205では、視線方向の推定処理を繰り返すか否かを判定する。推定処理を繰り返すか否かを判定する基準は、実施の形態に応じて適宜決定されてよい。
以上のとおり、本実施形態では、ステップS203において、対象者Rの視線方向を推定するのに、対象者Rの目の写る対象画像63だけではなく、特徴情報602及び真値情報603を含む較正情報60を利用する。特徴情報602及び真値情報603によれば、真値により既知の方向についての対象者Rの視線の個性を把握することができる。したがって、本実施形態によれば、較正情報60から把握可能な被験者及び対象者Rの間の個人差を考慮した上で、対象画像63に写る対象者Rの視線方向を推定することができる。そのため、ステップS203における対象者Rの視線方向を推定する精度の向上を図ることができる。較正情報60の利用により、斜視等により視線方向に目が向かない対象者Rについても、視線方向の推定精度の向上が期待できる。また、本実施形態では、較正情報60が、複数の異なる所定の方向それぞれに対応する特徴情報602及び真値情報603を含むようにしてもよい。これにより、複数の異なる所定の方向について較正情報60から対象者Rの視線の個性をより正確に把握可能である。そのため、対象者Rの視線方向を推定する精度の更なる向上を図ることができる。本実施形態に係るモデル生成装置1によれば、ステップS101~ステップS107の処理により、そのような高精度に対象者Rの視線方向を推定可能な学習済み推定モデル3を生成することができる。
以上、本発明の実施の形態を詳細に説明してきたが、前述までの説明はあらゆる点において本発明の例示に過ぎない。本発明の範囲を逸脱することなく種々の改良又は変形を行うことができることは言うまでもない。例えば、以下のような変更が可能である。なお、以下では、上記実施形態と同様の構成要素に関しては同様の符号を用い、上記実施形態と同様の点については、適宜説明を省略した。以下の変形例は適宜組み合わせ可能である。
上記実施形態では、較正情報60の取得にカメラSを利用している。しかしながら、対象者Rの視線を観測するためのセンサは、このような例に限定されなくてもよい。センサは、対象者Rの視線の特徴を観測可能であれば、その種類は、特に限定されなくてよく、実施の形態に応じて適宜選択されてよい。センサには、例えば、コイルを内包した強膜コンタクトレンズ、眼電位センサ等が用いられてよい。この場合、視線推定装置2は、上記実施形態と同様に、対象者Rに所定の方向を視るように指示を出力した後に、対象者Rの視線をセンサにより観測してもよい。この観測により得られたセンシングデータから特徴情報602を取得することができる。特徴情報602の取得には、例えば、サーチコイル法、EOG(electro-oculogram)法等が用いられてよい。
上記実施形態では、推定モデル3は、抽出器31及び推定器32により構成されている。較正情報60は、抽出器35及び結合器36を利用して、基準画像601及び真値情報603から導出される特徴量604により構成される。推定器32は、結合器36により導出される特徴量604及び対象画像63に関する特徴量64の入力を受け付けるように構成されている。しかしながら、推定モデル3及び較正情報60の構成は、このような例に限定されなくてもよい。
図13Aに示されるとおり、上記ステップS102では、制御部11は、複数のデータセット120を使用して、変換器41Cの機械学習を実施する。一例として、まず、制御部11は、各データセット120の学習画像121を変換器41Cに入力し、変換器41Cの演算処理を実行する。これにより、制御部11は、学習画像121から変換されたヒートマップに対応する出力値を変換器41Cから取得する。
図14に示されるとおり、本変形例では、学習結果データ125Cを保持することで、情報取得部211は、学習済み変換器35Cを有しており、推定部213は、学習済み推定モデル3Cを有している。学習済み推定モデル3Cは、学習済みの変換器31C及び推定器32Cを備えている。
上記実施形態では、各抽出器(31、35、41)には、畳み込みニューラルネットワークが用いられている。各推定器(32、43)及び結合器36には、全結合型ニューラルネットワークが用いられている。しかしながら、各抽出器(31、35、41)、各推定器(32、43)、及び結合器36に利用可能なニューラルネットワークの種類は、このような例に限定されなくてもよい。各抽出器(31、35、41)には、全結合型ニューラルネットワーク、再帰型ニューラルネットワーク等が用いられてよい。各推定器(32、43)及び結合器36には、畳み込みニューラルネットワーク、再帰型ニューラルネットワークが用いられてよい。
上記実施形態において、例えば、初期設定の処理内で上記ステップS201の処理が実行される等により、較正情報60が予め与えられてもよい。この場合、視線推定装置2の処理手順からステップS201の処理は省略されてよい。また、較正情報60が取得された後、この較正情報60を変更しない場合には、視線推定装置2において、学習済みの抽出器35及び結合器36は省略又は削除されてよい。較正情報60を取得する処理の少なくとも一部は他のコンピュータにより実行されてよい。この場合、視線推定装置2は、他のコンピュータの演算結果を取得することで、較正情報60を取得してもよい。
11…制御部、12…記憶部、13…通信インタフェース、
14…外部インタフェース、
15…入力装置、16…出力装置、17…ドライブ、
111…収集部、112…第1取得部、
113…第2取得部、114…機械学習部、
115…保存処理部、
120…データセット、
121…学習画像、123…正解情報、
125…学習結果データ、
81…モデル生成プログラム、91…記憶媒体、
2…視線推定装置、
21…制御部、22…記憶部、23…通信インタフェース、
24…外部インタフェース、
25…入力装置、26…出力装置、27…ドライブ、
211…情報取得部、212…画像取得部、
213…推定部、214…出力部、
261…ディスプレイ、M…印、
82…視線推定プログラム、92…記憶媒体、
30…学習モデル、3…推定モデル、
31…抽出器(第1抽出器)、
311…畳み込み層、312…プーリング層、
32…推定器、321…全結合層、
35…抽出器(第2抽出器)、
351…畳み込み層、352…プーリング層、
36…結合器、361…全結合層、
4…学習モデル、
41…抽出器、
411…畳み込み層、412…プーリング層、
43…推定器、431…全結合層、
50…学習用較正情報、
501…学習用基準画像、
502…学習用特徴情報、5021…特徴量、
503…学習用真値情報、
504…特徴量、
51…学習データセット、
53…学習用対象画像、54…特徴量、
55…正解情報、
60…較正情報、
601…基準画像、
602…特徴情報、6021…特徴量(第2特徴量)、
603…真値情報、
604…特徴量(較正特徴量)、
63…対象画像、64…特徴量(第1特徴量)、
R…対象者、S…カメラ
Claims (15)
- 所定の方向を視る対象者の目の視線に関する特徴情報、及び前記対象者の目の視る前記所定の方向の真値を示す真値情報を含む較正情報を取得する情報取得部と、
対象者の目の写る対象画像を取得する画像取得部と、
機械学習により生成された学習済み推定モデルを利用して、前記対象画像に写る前記対象者の視線方向を推定する推定部であって、
前記機械学習により、前記学習済み推定モデルは、被験者から得られた学習用較正情報及び学習用対象画像の入力に対して、当該学習用対象画像に写る当該被験者の視線方向の真値を示す正解情報に適合する出力値を出力するように訓練されており、
前記視線方向を推定することは、取得された前記対象画像及び前記較正情報を当該学習済み推定モデルに入力し、当該学習済み推定モデルの演算処理を実行することで、前記対象画像に写る前記対象者の視線方向を推定した結果に対応する出力値を当該学習済み推定モデルから取得することにより構成される、
推定部と、
前記対象者の前記視線方向を推定した結果に関する情報を出力する出力部と、
を備える、
視線推定装置。 - 前記較正情報は、複数の異なる前記所定の方向それぞれに対応する前記特徴情報及び前記真値情報を含む、
請求項1に記載の視線推定装置。 - 前記特徴情報及び前記真値情報を含むことは、前記特徴情報及び前記真値情報を結合することにより導出される較正に関する較正特徴量を含むことにより構成され、
前記学習済み推定モデルは、第1抽出器及び推定器を備え、
前記学習済み推定モデルの演算処理を実行することは、
取得された前記対象画像を前記第1抽出器に入力し、前記第1抽出器の演算処理を実行することで、前記対象画像に関する第1特徴量に対応する出力値を前記第1抽出器から取得すること、及び
前記較正特徴量及び取得された前記第1特徴量を前記推定器に入力し、前記推定器の演算処理を実行すること、
により構成される、
請求項1又は2に記載の視線推定装置。 - 前記特徴情報は、前記所定の方向を視る前記対象者の目の写る基準画像に関する第2特徴量により構成され、
前記情報取得部は、結合器を有し、
前記較正情報を取得することは、
前記第2特徴量を取得すること、
前記真値情報を取得すること、並びに
取得された前記第2特徴量及び前記真値情報を前記結合器に入力し、前記結合器の演算処理を実行することで、前記較正特徴量に対応する出力値を前記結合器から取得すること、
により構成される、
請求項3に記載の視線推定装置。 - 前記情報取得部は、第2抽出器を更に有し、
前記第2特徴量を取得することは、
前記基準画像を取得すること、及び
取得された前記基準画像を前記第2抽出器に入力し、前記第2抽出器の演算処理を実行することで、前記第2特徴量に対応する出力値を前記第2抽出器から取得すること、
により構成される、
請求項4に記載の視線推定装置。 - 前記学習済み推定モデルは、第1抽出器及び推定器を備え、
前記学習済み推定モデルの演算処理を実行することは、
取得された前記対象画像を前記第1抽出器に入力し、前記第1抽出器の演算処理を実行することで、前記対象画像に関する第1特徴量に対応する出力値を前記第1抽出器から取得すること、並びに
前記特徴情報、前記真値情報、及び取得された前記第1特徴量を前記推定器に入力し、前記推定器の演算処理を実行すること、
により構成される、
請求項1又は2に記載の視線推定装置。 - 前記特徴情報は、前記所定の方向を視る前記対象者の目の写る基準画像に関する第2特徴量により構成され、
前記情報取得部は、第2抽出器を有し、
前記較正情報を取得することは、
前記基準画像を取得すること、
取得された前記基準画像を前記第2抽出器に入力し、前記第2抽出器の演算処理を実行することで、前記第2特徴量に対応する出力値を前記第2抽出器から取得すること、及び
前記真値情報を取得すること、
により構成される、
請求項6に記載の視線推定装置。 - 前記特徴情報は、前記所定の方向を視る前記対象者の目の写る基準画像により構成され、
前記学習済み推定モデルは、第1抽出器、第2抽出器、及び推定器を備え、
前記学習済み推定モデルの演算処理を実行することは、
取得された前記対象画像を前記第1抽出器に入力し、前記第1抽出器の演算処理を実行することで、前記対象画像に関する第1特徴量に対応する出力値を前記第1抽出器から取得すること、
前記基準画像を前記第2抽出器に入力し、前記第2抽出器の演算処理を実行することで、前記基準画像に関する第2特徴量に対応する出力値を前記第2抽出器から取得すること、並びに、
取得された前記第1特徴量、取得された前記第2特徴量、及び前記真値情報を前記推定器に入力し、前記推定器の演算処理を実行すること、
により構成される、
請求項1又は2に記載の視線推定装置。 - 前記学習済み推定モデルは、第1変換器及び推定器を備え、
前記学習済み推定モデルの演算処理を実行することは、
取得された前記対象画像を前記第1変換器に入力し、前記第1変換器の演算処理を実行することで、前記対象者の視線方向に関する第1ヒートマップに対応する出力値を前記第1変換器から取得すること、並びに、
取得された前記第1ヒートマップ、前記特徴情報、及び前記真値情報を前記推定器に入力し、前記推定器の演算処理を実行すること、
により構成される、
請求項1又は2に記載の視線推定装置。 - 前記特徴情報は、前記所定の方向を視る前記対象者の目の写る基準画像から導出された、前記所定の方向を視る目の視線方向に関する第2ヒートマップにより構成され、
前記情報取得部は、第2変換器を有し、
前記較正情報を取得することは、
前記基準画像を取得すること、
取得された前記基準画像を前記第2変換器に入力し、前記第2変換器の演算処理を実行することで、前記第2ヒートマップに対応する出力値を前記第2変換器から取得すること、
前記真値情報を取得すること、及び
前記所定の方向の真値に関する第3ヒートマップに前記真値情報を変換すること、
により構成され、
前記第1ヒートマップ、前記特徴情報、及び前記真値情報を前記推定器に入力することは、前記第1ヒートマップ、前記第2ヒートマップ、及び前記第3ヒートマップを前記推定器に入力することにより構成される、
請求項9に記載の視線推定装置。 - 前記画像取得部により前記対象画像の取得、及び前記推定部による前記対象者の視線方向の推定は繰り返し実行される、
請求項1から10のいずれか1項に記載の視線推定装置。 - 前記情報取得部は、前記対象者に所定の方向を視るように指示を出力した後、前記対象者の視線をセンサにより観測することで前記較正情報を取得する、
請求項1から11のいずれか1項に記載の視線推定装置。 - コンピュータが、
所定の方向を視る対象者の目の視線に関する特徴情報、及び前記対象者の目の視る前記所定の方向の真値を示す真値情報を含む較正情報を取得するステップと、
対象者の目の写る対象画像を取得するステップと、
機械学習により生成された学習済み推定モデルを利用して、前記対象画像に写る前記対象者の視線方向を推定するステップであって、
前記機械学習により、前記学習済み推定モデルは、被験者から得られた学習用較正情報及び学習用対象画像の入力に対して、当該学習用対象画像に写る当該被験者の視線方向の真値を示す正解情報に適合する出力値を出力するように訓練されており、
前記視線方向を推定することは、取得された前記対象画像及び前記較正情報を当該学習済み推定モデルに入力し、当該学習済み推定モデルの演算処理を実行することで、前記対象画像に写る前記対象者の視線方向を推定した結果に対応する出力値を当該学習済み推定モデルから取得することにより構成される、
ステップと、
前記対象者の前記視線方向を推定した結果に関する情報を出力するステップと、
を実行する、
視線推定方法。 - 所定の方向を視る被験者の目の視線に関する学習用特徴情報、及び前記被験者の目の視る前記所定の方向の真値を示す学習用真値情報を含む学習用較正情報を取得する第1取得部と、
被験者の目の写る学習用対象画像、及び前記学習用対象画像に写る前記被験者の視線方向の真値を示す正解情報の組み合わせによりそれぞれ構成される複数の学習データセットを取得する第2取得部と、
取得された前記複数の学習データセットを使用して、推定モデルの機械学習を実施する機械学習部であって、機械学習を実施することは、前記各学習データセットについて、前記学習用対象画像及び前記学習用較正情報の入力に対して、対応する前記正解情報に適合する出力値を出力するように前記推定モデルを訓練することにより構成される、機械学習部と、
を備える、
モデル生成装置。 - コンピュータが、
所定の方向を視る被験者の目の視線に関する学習用特徴情報、及び前記被験者の目の視る前記所定の方向の真値を示す学習用真値情報を含む学習用較正情報を取得するステップと、
被験者の目の写る学習用対象画像、及び前記学習用対象画像に写る前記被験者の視線方向の真値を示す正解情報の組み合わせによりそれぞれ構成される複数の学習データセットを取得するステップと、
取得された前記複数の学習データセットを使用して、推定モデルの機械学習を実施するステップであって、機械学習を実施することは、前記各学習データセットについて、前記学習用対象画像及び前記学習用較正情報の入力に対して、対応する前記正解情報に適合する出力値を出力するように前記推定モデルを訓練することにより構成される、ステップと、
を実行する、
モデル生成方法。
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021569686A JP7310931B2 (ja) | 2020-01-10 | 2020-01-10 | 視線推定装置、視線推定方法、モデル生成装置、及びモデル生成方法 |
| PCT/JP2020/000643 WO2021140642A1 (ja) | 2020-01-10 | 2020-01-10 | 視線推定装置、視線推定方法、モデル生成装置、及びモデル生成方法 |
| CN202080085841.0A CN114787861B (zh) | 2020-01-10 | 2020-01-10 | 视线推测装置、视线推测方法、模型生成装置以及模型生成方法 |
| EP20911379.4A EP4089628B1 (en) | 2020-01-10 | 2020-01-10 | Sight line estimation device, sight line estimation method, model generation device, and model generation method |
| US17/789,234 US12243351B2 (en) | 2020-01-10 | 2020-01-10 | Gaze estimation apparatus, gaze estimation method, model generation apparatus, and model generation method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2020/000643 WO2021140642A1 (ja) | 2020-01-10 | 2020-01-10 | 視線推定装置、視線推定方法、モデル生成装置、及びモデル生成方法 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021140642A1 true WO2021140642A1 (ja) | 2021-07-15 |
Family
ID=76787774
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2020/000643 Ceased WO2021140642A1 (ja) | 2020-01-10 | 2020-01-10 | 視線推定装置、視線推定方法、モデル生成装置、及びモデル生成方法 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US12243351B2 (ja) |
| EP (1) | EP4089628B1 (ja) |
| JP (1) | JP7310931B2 (ja) |
| CN (1) | CN114787861B (ja) |
| WO (1) | WO2021140642A1 (ja) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024232398A1 (ja) * | 2023-05-10 | 2024-11-14 | 国立大学法人 東京大学 | 学習モデル生成プログラム、情報処理装置及び学習モデル生成方法 |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113383295B (zh) * | 2019-02-01 | 2025-04-25 | 苹果公司 | 调节数字内容以激发更大的瞳孔半径响应的生物反馈方法 |
| CN119338869B (zh) * | 2023-07-20 | 2026-01-13 | 北京字跳网络技术有限公司 | 校准图像的筛选方法、视线估计方法、装置及设备 |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2019028843A (ja) | 2017-08-01 | 2019-02-21 | オムロン株式会社 | 人物の視線方向を推定するための情報処理装置及び推定方法、並びに学習装置及び学習方法 |
| US20190303724A1 (en) * | 2018-03-30 | 2019-10-03 | Tobii Ab | Neural Network Training For Three Dimensional (3D) Gaze Prediction With Calibration Parameters |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108171152A (zh) * | 2017-12-26 | 2018-06-15 | 深圳大学 | 深度学习人眼视线估计方法、设备、系统及可读存储介质 |
| EP3750029A1 (en) | 2018-02-09 | 2020-12-16 | Pupil Labs GmbH | Devices, systems and methods for predicting gaze-related parameters using a neural network |
| CN109835260B (zh) * | 2019-03-07 | 2023-02-03 | 百度在线网络技术(北京)有限公司 | 一种车辆信息显示方法、装置、终端和存储介质 |
| US20220043509A1 (en) * | 2019-06-28 | 2022-02-10 | Tobii Ab | Gaze tracking |
| US11308698B2 (en) * | 2019-12-05 | 2022-04-19 | Facebook Technologies, Llc. | Using deep learning to determine gaze |
-
2020
- 2020-01-10 JP JP2021569686A patent/JP7310931B2/ja active Active
- 2020-01-10 US US17/789,234 patent/US12243351B2/en active Active
- 2020-01-10 WO PCT/JP2020/000643 patent/WO2021140642A1/ja not_active Ceased
- 2020-01-10 CN CN202080085841.0A patent/CN114787861B/zh active Active
- 2020-01-10 EP EP20911379.4A patent/EP4089628B1/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2019028843A (ja) | 2017-08-01 | 2019-02-21 | オムロン株式会社 | 人物の視線方向を推定するための情報処理装置及び推定方法、並びに学習装置及び学習方法 |
| US20190303724A1 (en) * | 2018-03-30 | 2019-10-03 | Tobii Ab | Neural Network Training For Three Dimensional (3D) Gaze Prediction With Calibration Parameters |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4089628A4 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024232398A1 (ja) * | 2023-05-10 | 2024-11-14 | 国立大学法人 東京大学 | 学習モデル生成プログラム、情報処理装置及び学習モデル生成方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2021140642A1 (ja) | 2021-07-15 |
| EP4089628B1 (en) | 2026-03-04 |
| JP7310931B2 (ja) | 2023-07-19 |
| US20230036611A1 (en) | 2023-02-02 |
| CN114787861B (zh) | 2026-01-16 |
| EP4089628A1 (en) | 2022-11-16 |
| CN114787861A (zh) | 2022-07-22 |
| EP4089628A4 (en) | 2023-05-10 |
| US12243351B2 (en) | 2025-03-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR102345026B1 (ko) | 클라우드 서버 및 클라우드 서버 기반의 진단 보조 시스템 | |
| CN111046734B (zh) | 基于膨胀卷积的多模态融合视线估计方法 | |
| CN110570421B (zh) | 多任务的眼底图像分类方法和设备 | |
| US10884494B1 (en) | Eye tracking device calibration | |
| JP2021184289A (ja) | 眼画像の収集、選択および組み合わせ | |
| CN109583338A (zh) | 基于深度融合神经网络的驾驶员视觉分散检测方法 | |
| CN112673378A (zh) | 推断器生成装置、监视装置、推断器生成方法以及推断器生成程序 | |
| CN110688874B (zh) | 人脸表情识别方法及其装置、可读存储介质和电子设备 | |
| KR20190070432A (ko) | 영상 데이터를 분석하는 인공 지능을 이용한 질병 진단 방법 및 진단 시스템 | |
| CN114120432A (zh) | 基于视线估计的在线学习注意力跟踪方法及其应用 | |
| JP7645041B2 (ja) | 軸外カメラを使用して眼追跡を実施するための方法およびシステム | |
| CN110503068A (zh) | 视线估计方法、终端及存储介质 | |
| KR102122302B1 (ko) | 안저 촬영기를 제어하는 방법 및 이를 이용한 장치 | |
| WO2021140642A1 (ja) | 視線推定装置、視線推定方法、モデル生成装置、及びモデル生成方法 | |
| CN112183160A (zh) | 视线估计方法及装置 | |
| JP7638962B2 (ja) | 軸外カメラを使用した眼追跡および視線推定 | |
| KR102657095B1 (ko) | 탈모 상태 정보 제공 방법 및 장치 | |
| EP4325517A1 (en) | Methods and devices in performing a vision testing procedure on a person | |
| CN118092661A (zh) | 一种融合多观测角度眼部图像特征的注视估计装置及方法 | |
| CN116012932B (zh) | 一种驾驶员自适应的注视方向估计方法 | |
| KR20230024232A (ko) | 영상 처리를 이용한 피부 질환 진단 방법 및 장치 | |
| KR20190114602A (ko) | 다중 구조 인공신경망을 이용한 가상 피팅을 수행하기 위한 장치, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체 | |
| US20220211267A1 (en) | Device navigation and capture of media data | |
| Chandran et al. | Real time diagnosis of ocular diseases using AI | |
| CN120147768A (zh) | 模型训练、视线估计方法,电子设备及存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20911379 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021569686 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2020911379 Country of ref document: EP Effective date: 20220810 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 17789234 Country of ref document: US |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2020911379 Country of ref document: EP |