WO2021140957A1 - 情報処理装置、情報処理方法、及び、プログラム - Google Patents
情報処理装置、情報処理方法、及び、プログラム Download PDFInfo
- Publication number
- WO2021140957A1 WO2021140957A1 PCT/JP2020/048727 JP2020048727W WO2021140957A1 WO 2021140957 A1 WO2021140957 A1 WO 2021140957A1 JP 2020048727 W JP2020048727 W JP 2020048727W WO 2021140957 A1 WO2021140957 A1 WO 2021140957A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data set
- data
- unit
- information processing
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
Definitions
- the present disclosure relates to an information processing device, an information processing method, and a program, and more particularly to an information processing device, an information processing method, and a program designed to improve user satisfaction with predictive analysis.
- Leak means using data that cannot be used at the time of prediction for training a prediction model, or the data itself. For example, when predicting the number of bento boxes sold, it is expected that the prediction accuracy of the forecast model will be improved if the data "whether or not the bento boxes are sold out on that day" is used for learning. However, the purpose of predicting the number of sales is to predict the number of bento boxes sold a few days ago to determine the number of products manufactured, and the data that is not known until the day of sold out is actually used. Can't.
- the prediction accuracy of the prediction model differs between the time of pre-evaluation and the time of actual operation, and the user's satisfaction level. May decrease.
- Patent Document 1 the leak is not particularly examined.
- This disclosure was made in view of such a situation, and is intended to improve user satisfaction with predictive analysis.
- the present disclosure includes a leak detection unit that detects the possibility of leakage of the data set based on the nature of the data set used for learning the prediction model used for the prediction analysis.
- the possibility of leakage of the data set is detected based on the nature of the data set used for learning the prediction model used for the prediction analysis.
- Embodiment 1-1 Embodiment 1-1. Background 1-2. Configuration example of the information processing system according to the embodiment 1-3. Outline of information processing according to the embodiment 1-4. Configuration example of the information processing device according to the embodiment 1-5. Configuration example of the time series processing unit according to the embodiment 1-6. Configuration example of leak detection and correction unit according to the embodiment 1-7. Information processing procedure according to the embodiment 2. Modification example 3. Hardware configuration
- the user decides what kind of predictive analysis should be performed based on the accumulated data. Furthermore, the user evaluates the business effect obtained by introducing the predictive analysis by conducting a proof experiment of the determined predictive analysis. In this way, by conducting a demonstration experiment and evaluating the business effect obtained by the predictive analysis, the user can introduce a highly effective predictive analysis into the business, and the predictive analysis can be utilized in the business.
- FIG. 1 is a diagram illustrating a business introduction of predictive analysis.
- step S1 the user sets a problem as to which data is used and what is predicted in the accumulated data. For example, when setting a problem, "use data such as the customer's annual income and total assets to predict whether or not a loan will be in bad debt" or "data such as past sales and customer age group”. Use it to predict future sales. " In this way, the appropriate problem setting differs depending on the business field and the user. Therefore, the user sets a problem based on his / her own knowledge and experience, for example.
- step S2 the user constructs a data set according to the problem setting from the accumulated data.
- the user constructs a data set by, for example, extracting data to be used for predictive analysis from accumulated data, and interpreting and structuring the data according to the predictive analysis. For example, the knowledge and experience of the user may be required to construct the data set.
- step S3 the user generates a prediction model based on the problem setting and the data set. Predictive models are generated using common machine learning. In this case, the user can generate a prediction model using, for example, an existing information processing device.
- step S4 the user evaluates the accuracy of the generated prediction model.
- the accuracy of the prediction model is evaluated using a general evaluation index such as the area under the curve (AUC) or the accuracy.
- AUC area under the curve
- the user can evaluate the accuracy of the prediction model using, for example, an existing information processing device.
- step S5 the user conducts a demonstration experiment using the generated prediction model.
- the user collects data in a limited range such as a period or region, and performs predictive analysis of the data using the generated prediction model.
- the user introduces predictive analysis into the business on a trial basis, for example, by changing the purchase of goods or the business partner according to the analysis result.
- step S6 the user measures the effect of the demonstration experiment.
- the user measures the effect by comparing the data before and after the experiment, for example, comparing the sales when the predictive analysis is introduced on a trial basis with the sales before the introduction.
- the user then introduces predictive analytics into the actual business, depending on the results of the proof-of-concept experiments and the measured effects.
- This disclosure focuses on this point and allows the information processing device to perform predictive analysis including extraction of problem settings and construction of data sets.
- FIG. 2 is a diagram showing a configuration example of the information processing system 1 according to the embodiment of the present disclosure.
- the information processing system 1 includes a terminal device 11 and an information processing device 12.
- the terminal device 11 and the information processing device 12 are connected to each other via a predetermined communication network (network N) so as to be communicable by wire or wirelessly.
- the information processing system 1 may include a plurality of terminal devices 11 and a plurality of information processing devices 12.
- the terminal device 11 is an information processing device used by the user.
- the terminal device 11 is used to provide a service related to predictive analysis.
- the terminal device 11 may be any device as long as the processing in the embodiment can be realized.
- the terminal device 11 may be any device as long as it provides a service related to predictive analysis to the user and has a display for displaying information.
- the terminal device 11 is a device such as a notebook PC (Personal Computer), a desktop PC, a tablet terminal, a smartphone, a mobile phone, or a PDA (Personal Digital Assistant).
- the information processing device 12 is used to provide a service related to predictive analysis to a user.
- the information processing device 12 is an information processing device that controls information on the results of problem setting and predictive analysis evaluation based on user data so as to be displayed to the user.
- the information processing device 12 generates an image showing information about the result of the problem setting and the predictive analysis evaluation, and provides the image to the terminal device 11.
- the information processing device 12 controls the display of the terminal device 11. That is, the information processing device 12 is also a server device that provides information to be displayed on the terminal device 11. For example, the information processing device 12 controls the display of the terminal device 11 by transmitting an image including control information to the terminal device 11.
- the control information is described in, for example, a script language such as Javascript (registered trademark), CSS, or the like.
- the information processing device 12 may provide the terminal device 11 with an application for displaying the provided image or the like. Further, the application itself provided from the information processing device 12 to the terminal device 11 may be regarded as control information.
- FIG. 3 is a diagram schematically showing an analysis process according to the embodiment of the present disclosure.
- FIG. 4 is a diagram illustrating an example of a past case according to the embodiment of the present disclosure.
- FIG. 5 is a diagram illustrating an example of user data according to the embodiment of the present disclosure.
- the user data is, for example, data collected by the user.
- the user data includes various data such as customer information and product information.
- the user uses the user data to perform predictive analysis such as next month's sales.
- the information processing device 12 acquires a past case.
- the past case includes the problem setting of the predictive analysis performed in the past.
- the prediction target hereinafter, also referred to as the past target
- the analysis data used for the prediction analysis of the past target such as which data was used for the past prediction.
- a set hereinafter, also referred to as a past data set
- the past case includes, for example, the past data set 31.
- the historical dataset 31 includes, for example, "customer ID”, “loan amount”, “loan type”, “years of service” and “credit loss”. Further, in FIG. 4, it is indicated by diagonal lines that “credit loss” is a past target.
- the past case includes the past data set 31 and the past object (here, “credit loss”).
- step S12 the information processing device 12 acquires user data.
- the user data is data generated and collected by the user, and is data used for model generation of predictive analysis and the like.
- the user data 41 shown in FIG. 5 includes, for example, "customer ID”, "loan amount”, “loan type”, “years of service”, “annual income”, “total account balance”, and "credit loss”.
- the information processing device 12 extracts a prediction target based on the acquired past case and user data 41.
- the information processing device 12 selects, for example, a past object related to the user from past cases.
- the information processing device 12 selects a past target by using a recommender system using information about the user, such as a department to which the information processing device 12 belongs or a predictive analysis performed by the user in the past.
- a recommender system using information about the user, such as a department to which the information processing device 12 belongs or a predictive analysis performed by the user in the past.
- the information processing apparatus 12 selects the “credit loss” of the past data set 31 shown in FIG. 4 from the past case as the past target.
- the information processing device 12 extracts the same items as the selected past target from the user data 41 as a prediction target (hereinafter, also referred to as an extraction target) for which the prediction analysis is performed this time.
- a prediction target hereinafter, also referred to as an extraction target
- the past target selected by the information processing apparatus 12 is “lost debt”. Therefore, the information processing device 12 extracts "credit loss” from the user data 41 shown in FIG. 5 as a prediction target.
- the “credit loss” to be extracted is indicated by a diagonal line. The details of the extraction method of the extraction target will be described later with reference to FIG. 7.
- the information processing apparatus 12 constructs a data set (hereinafter, also referred to as a construction data set) used for predictive analysis of the extraction target based on the user data 41.
- the information processing device 12 extracts, for example, items related to the extraction target as a construction data set. For example, the information processing device 12 extracts "customer ID”, “loan amount”, “loan type”, “years of service”, and "credit loss” from the user data 41 shown in FIG. 5 to generate a construction data set.
- the information processing device 12 constructs a data set including a part of the user data 41 shown in FIG. 5, but the present invention is not limited to this.
- a data set including all user data 41 may be constructed. The details of the data set construction method will be described later with reference to FIG. 7.
- step S15 the information processing apparatus 12 learns the prediction model based on the extraction target and the construction data set.
- the information processing device 12 converts the data of the construction data set excluding the prediction target (extraction target) into a feature vector.
- the information processing apparatus 12 generates a prediction model by solving a classification or regression problem by machine learning based on a feature vector and an extraction target.
- step S16 the information processing device 12 evaluates the accuracy of the prediction analysis by evaluating the generated prediction model.
- the information processing device 12 evaluates the prediction model using the prediction model and the construction data set.
- the evaluation index is selected according to the analysis method, for example, AUC or Accuracy in the case of classification analysis, MAE (Mean Absolute Error) in the case of regression analysis, and the like.
- step S17 the information processing device 12 presents the extraction information including the extraction target and the evaluation result to the user.
- FIG. 6 is a diagram showing an example of an image presented to the user in the terminal device 11 under the control of the information processing device 12.
- the information processing device 12 presents the user with a combination of the problem setting and the evaluation result.
- the extraction result when the information processing apparatus 12 extracts a plurality of problem settings is displayed.
- the information processing apparatus 12 displays a list of combinations of problem setting and evaluation results as in the image IM1.
- the user can decide whether or not to perform the predictive analysis with the problem setting presented by the information processing apparatus 12 by referring to the evaluation result, for example.
- the content presented to the user by the information processing device 12 is not limited to the problem setting and the evaluation result.
- the information processing apparatus 12 may present at least one of the construction data set, the extraction target, and the evaluation result to the user.
- the information processing apparatus 12 may present reference information when the user selects a problem setting, such as an effect obtained when predictive analysis is performed.
- the information processing device 12 extracts the problem setting, so that the user does not have to perform the problem setting and can perform the predictive analysis more easily. Further, when the information processing apparatus 12 evaluates the accuracy of the predictive analysis, the user can select the predictive analysis to be executed based on the accuracy evaluation, and the predictive analysis with high accuracy can be performed more easily.
- the information processing device 12 includes a communication unit 101, a storage unit 102, and a control unit 103.
- the information processing device 12 includes an input unit (for example, a keyboard, a mouse, etc.) that receives various operations from the administrator of the information processing device 12, and a display unit (for example, a liquid crystal display, etc.) for displaying various information. You may have it.
- the communication unit 101 is realized by, for example, a NIC (Network Interface Card) or the like. Then, the communication unit 101 is connected to the network N by wire or wirelessly, and transmits / receives information to / from another information processing device such as a terminal device 11 or an external server.
- a NIC Network Interface Card
- the storage unit 102 is realized by, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk. As shown in FIG. 7, the storage unit 102 includes a past case storage unit 121, a user data storage unit 122, and a user profile storage unit 123. Although not shown, the storage unit 102 may store various information such as an image that is the basis of the image provided to the terminal device 11.
- a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory)
- a storage device such as a hard disk or an optical disk.
- the storage unit 102 includes a past case storage unit 121, a user data storage unit 122, and a user profile storage unit 123.
- the storage unit 102 may store various information such as an image that is the basis of the image provided to the terminal device 11.
- the past case storage unit 121 stores past cases.
- Past cases include information about predictive analytics performed in the past.
- the past case storage unit 121 stores, for example, a case when predictive analysis is introduced into the business in the past.
- the past cases may be appropriately acquired from an external server or the like without being held by the information processing device 12.
- FIG. 8 is a diagram showing an example of the past case storage unit 121.
- the past case storage unit 121 provides information on, for example, "problem setting", “data set”, “collection cost”, “prediction model”, “model evaluation result”, “demonstration experiment”, “business effect”, etc. for each case.
- the past case storage unit 121 stores a plurality of past cases, such as past cases A, B, and so on.
- “Problem setting” is information indicating what data was used and what was predicted in the predictive analysis.
- “problem setting” for example, there are a plurality of "use items” (explanatory variables) such as “what data was used” and one "prediction target” (objective variable) such as "what was predicted”. included.
- "use items” explanatory variables
- prediction target objective variable
- the items shown by diagonal lines are the prediction targets, and the remaining items are the items to be used.
- the "data set” is a past data set used for training the prediction model.
- a “data set” is a data set including "input data” and "correct answer data”.
- the past data set 31 shown in FIG. 4 corresponds to such a “data set”.
- Collection cost is the cost of collecting the data used in the predictive analytics.
- the “collection cost” includes, for example, the period and cost required for collecting data for each item.
- the "prediction model” is a past prediction model (hereinafter, also referred to as a past model) generated by using the "problem setting" and "data set” to be stored.
- a “predictive model” is a model generated by solving a classification or regression problem, for example, by machine learning.
- the “model evaluation result” is the result of the accuracy evaluation of the "prediction model” to be stored.
- the “model evaluation result” includes the evaluation result by an evaluation index such as AUC or Accuracy.
- “Demonstration experiment” is information on the contents and results of the demonstration experiment conducted for the business introduction of predictive analysis.
- the “demonstration experiment” includes, for example, information such as the period and range of the experiment, the data used in the experiment, the effect obtained by the experiment, and the cost of the experiment.
- Business effect is information on the business effect obtained after introducing predictive analysis into the business.
- the "business effect” includes, for example, information such as a profit amount such as an improved sales amount and a cost reduction amount such as a reduced labor cost.
- the past case storage unit 121 stores various information when the predictive analysis is introduced into the business in the past for each of a plurality of past cases.
- the above-mentioned past case is an example, and if the past case storage unit 121 stores the "problem setting" and the "data set", for example, the "collection cost", the "model evaluation result", and the "demonstration experiment”. Etc., some information may not be stored, or information other than the above-mentioned information may be stored.
- User data storage unit 122 Returning to FIG. 7, the user data storage unit 122 will be described.
- User data is various data created or collected by the user.
- As the data format of user data a wide variety of formats are assumed, for example, as listed below.
- the user data may be appropriately acquired from the terminal device 11, an external server, or the like without being held by the information processing device 12. Further, the user data may be raw data directly acquired from a camera, a sensor, or the like, or may be processed data obtained by performing processing such as feature extraction on the raw data. Alternatively, the user data may include metadata that is a recognition result obtained by performing recognition processing of raw data or processed data.
- the user profile storage unit 123 stores profile information about the user.
- the profile information includes, for example, user information and user case information.
- the user information is information about the user, and includes, for example, information about the user ID, the company name to which the user belongs, the department, the industry, and the like.
- the user information may include information related to the user's interests and interests, such as search history of websites and databases, browsing history of websites, keywords contained in e-mails and office documents.
- the user case information includes information related to the past predictive analysis performed by the user.
- the user case information includes, for example, information on predictive analysis performed by the user in the past, information on past cases in which the user has been involved, and the like. It should be noted that such predictive analysis may be performed by the user himself or herself, or may be performed by the department or company to which the user belongs.
- control unit 103 for example, a program stored inside the information processing apparatus 12 (for example, a program according to the present disclosure) is executed by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like using the RAM or the like as a work area. It is realized by being done. Further, the control unit 103 is a controller, and is realized by, for example, an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).
- ASIC Application Specific Integrated Circuit
- FPGA Field Programmable Gate Array
- the control unit 103 includes an acquisition unit 141, an information processing unit 142, and a display control unit 143.
- the information processing unit 142 includes a time prediction unit 151, an interpretation unit 152, an extraction unit 153, a time series processing unit 154, a leak detection correction unit 155, a learning unit 156, an evaluation unit 157, a prediction unit 158, a collection decision unit 159, and the like.
- the contribution calculation unit 160 is provided.
- the internal structure of the control unit 103 is not limited to the configuration shown in FIG. 7, and may be another configuration as long as it is a configuration for performing information processing described later. Further, the connection relationship of each processing unit included in the control unit 103 is not limited to the connection relationship shown in FIG. 7, and may be another connection relationship.
- the acquisition unit 141 acquires various information from the storage unit 102. For example, the acquisition unit 141 acquires a plurality of past cases from the past case storage unit 121. For example, the acquisition unit 141 acquires user data from the user data storage unit 122. For example, the acquisition unit 141 acquires profile information from the user profile storage unit 123.
- the acquisition unit 141 may acquire various information from an external server, a terminal device 11, or the like instead of the past case storage unit 121, the user data storage unit 122, and the user profile storage unit 123.
- the time prediction unit 151 predicts the time required for the analysis process performed by the control unit 103 from the start of data acquisition by the acquisition unit 141 to the presentation of the processing result such as problem setting extraction to the user.
- the time prediction unit 151 uses the user data (hereinafter, also referred to as a part of data) acquired by the acquisition unit 141 at a predetermined time (for example, 1 second) to perform analysis processing such as extraction, learning, and evaluation of problem settings. Do.
- the analysis process is a process performed by each unit of the control unit 103 from the start of data acquisition by the acquisition unit 141 to the presentation of the processing result to the user, and the details will be described later.
- the time prediction unit 151 measures the processing time of the analysis process performed using some data.
- the analysis process may take several hours or more, and in some cases several days, depending on the type and size of user data. Therefore, there is a user's request to know the time required for the analysis process. Therefore, the time prediction unit 151 calculates the prediction processing time using some data. As a result, it is possible to present the user with an estimate of the time required for the analysis process. At this time, by limiting the size of the data used for calculating the predicted processing time to a size that can be acquired in, for example, 1 second, the time required for calculating the predicted processing time can be suppressed to a short time.
- the time prediction unit 151 does not simply calculate the prediction processing time from the size of the user data, but actually executes the analysis processing using a part of the data to calculate the prediction processing time.
- the size of user data can be easily obtained, but the time required for predictive analysis depends not only on the size of user data but also on the nature of the data. Therefore, the time prediction unit 151 can actually execute the processing and calculate the prediction processing time, so that the prediction accuracy of the prediction processing time can be improved.
- the time prediction unit 151 calculates the prediction processing time using some data acquired at a predetermined time, but the prediction processing time is not limited to this.
- the time prediction unit 151 may calculate the prediction processing time using some data of a predetermined size (for example, 100 rows to 2000 rows).
- the time prediction unit 151 may predict the prediction processing time using a trained processing time prediction model prepared in advance.
- the time prediction unit 151 uses, for example, the number of items (number of columns) from some data, the loss rate of each item, the data type of each item (character string / numerical value / date, etc.), and the type of machine learning (binary value). Extract information such as classification / multi-value classification / regression, etc.). The time prediction unit 151 uses the extracted information to predict the prediction processing time by the learned processing time prediction model.
- the time prediction unit 151 may update the prediction processing time at a predetermined timing such as the passage of a certain time or the timing when the processing of each unit is completed.
- the time prediction unit 151 uses some data to execute a process that has not yet been completed at a predetermined timing.
- the time prediction unit 151 updates the prediction processing time by recalculating the prediction processing time based on the time required for the executed processing.
- the partial data used for updating the predicted processing time may be the same as the partial data used for calculating the predicted processing time before the update, or may be the user data acquired again at the time of this update. ..
- the interpretation unit 152 which will be described later, performs structured processing on all user data
- user data of a predetermined size is acquired from all the user data that has undergone structured processing and is used as a part of data. May be good.
- the interpretation unit 152 analyzes and structures the user data acquired from the user data storage unit 122 by the acquisition unit 141. First, the data analysis performed by the interpretation unit 152 will be described.
- user data includes various data formats.
- the interpretation unit 152 analyzes user data using, for example, a recognizer (not shown) for each type of data. It is assumed that the recognizer is stored in the storage unit 102, for example.
- the interpretation unit 152 performs recognition processing for detecting an face / character string / general object or the like from the image using an image recognizer for the image data included in the user data, for example.
- the interpretation unit 152 detects a user ID (terminal ID), a shooting location, a shooting time, and the like from the data given to the image.
- the interpretation unit 152 detects a character string from the image, and the telephone number, company name, purchased product, product price, total amount, payment method (cash / credit / electronic money / QR code (registered trademark) payment, etc.) And so on.
- the interpretation unit 152 adds the recognition result as metadata to the user data which is the raw data.
- the interpretation unit 152 recognizes the speaker using a voice recognizer for the voice data included in the user data, and converts the utterance content into text. Alternatively, the interpretation unit 152 recognizes the user's moving behavior (walking / bicycle / train, etc.) for each time with respect to the acceleration data. Further, the interpretation unit 152 corrects the notational fluctuation and adds a similar expression using a synonym dictionary to the text data. In this way, the interpretation unit 152 analyzes the user data for each type of data and adds metadata.
- the interpretation unit 152 recognizes one data using a plurality of recognizers. It may be.
- the interpretation unit 152 first converts the voice data into text data, and then translates the converted text data into multiple languages. Subsequently, the interpretation unit 152 corrects the notational fluctuation of the translated text data and adds a similar expression. In this way, the interpreting unit 152 may recognize the user data by using the recognizer in multiple stages.
- the interpretation unit 152 may recognize the user data based on various known techniques.
- the interpretation unit 152 structures the user data based on the analysis result.
- the interpretation unit 152 structures the metadata added to the user data by using the template.
- the template is specialized for predictive analysis, and it is assumed that, for example, a plurality of templates are stored in advance by the storage unit 102.
- the interpretation unit 152 structures the data by applying the data to the most suitable template.
- the interpretation unit 152 structures the metadata which is unstructured data.
- the interpretation unit 152 may newly add metadata.
- the metadata given here is used when extracting the problem setting.
- the interpretation unit 152 may add higher categories such as "food expenses” and “living miscellaneous expenses” from the "product name" assigned to the receipt image as metadata.
- the interpretation unit 152 may structure the user data based on various known techniques. Further, the above-mentioned template and higher category are examples, and the interpretation unit 152 may structure user data by using various templates, categories, and metadata specialized for predictive analysis. Further, when the user data stored in the user data storage unit 122 is already structured, the processing of the interpretation unit 152 may be omitted.
- the interpretation unit 152 analyzes and structures the user data, so that the burden on the user can be reduced.
- the extraction unit 153 extracts the problem setting in the predictive analysis based on the user data structured by the interpretation unit 152 (hereinafter, also referred to as structured data) and the past case acquired by the acquisition unit 141.
- the problem setting includes a plurality of "use items” (explanatory variables) "what data items are used” and one "prediction target” (objective variable) "what to predict”.
- Extraction unit 153 extracts "prediction target” from structured data based on past cases.
- the extraction unit 153 for example, extracts the same items (variables) as the past objects included in the past cases from the structured data as “prediction targets”.
- the extraction unit 153 extracts a "prediction target" that is related to the user or is considered to be of high interest to the user, for example, based on the profile information. For example, if a user is in the business of selling merchandise, they may be more interested in "sales” forecasts. Therefore, in this case, the extraction unit 153 extracts "sales” as a prediction target.
- the extraction unit 153 extracts candidates from past objects of past cases using a recommendation system, for example, based on profile information.
- the extraction unit 153 sets the items included in the user data from the extracted candidates as the "prediction target" of the problem setting.
- Recommender systems include, for example, ranking learning, content-based filtering, collaborative filtering, or a combination of these.
- the extraction unit 153 may extract a plurality of "prediction targets". For example, when a plurality of past objects are extracted in a ranking format as in ranking learning, the extraction unit 153 extracts a predetermined number of "prediction targets” from the top ranking. In this way, the extraction unit 153 extracts a plurality of "prediction targets", so that the extraction unit 153 can extract a wide range of "prediction targets” related to the user.
- the extraction unit 153 extracts a plurality of "use items” for each extracted “prediction target” (extraction target).
- the extraction unit 153 sets items (variables) related to the extraction target from the structured data to "use items" (explanatory variables).
- the extraction unit 153 may set an item related to the extraction target as a "use item”. In this case, the information processing device 12 can improve the learning accuracy in the prediction model learning, which is the processing after extraction.
- the extraction unit 153 may set a predetermined number of items as "use items” in order from the one having the highest relevance to the extraction target. In this case, the information processing device 12 can reduce the processing load in the prediction model learning.
- the extraction unit 153 constructs a data set based on the extracted "prediction target" and "use item” (hereinafter, also referred to as extraction item).
- the extraction unit 153 constructs a data set by extracting data corresponding to the prediction target and the extraction item from the structured data.
- the extraction unit 153 may, for example, extract a plurality of problem settings.
- the extraction unit 153 extracts a plurality of combinations of the "prediction target" and the plurality of "use items" corresponding to the "prediction target".
- the extraction unit 153 constructs a data set according to the extracted problem setting. Therefore, when a plurality of problem settings are extracted, the extraction unit 153 constructs a plurality of data sets corresponding to each problem setting. By constructing the data set by the extraction unit 153 in this way, even if there are a plurality of problem settings, the user does not need to construct each corresponding data set, and the burden on the user can be reduced.
- time-series processing unit 154 When the construction data set constructed by the extraction unit 153 is a time-series data set (hereinafter referred to as a time-series data set), the time-series processing unit 154 resamples the date and time of the time-series data set. For example, the time-series processing unit 154 corrects the sampling interval (time interval) of the time-series data set and interpolates the missing value to be predicted.
- the time series processing unit 154 separates the time series data set for each series, and displays the date and time of the time series data set for each separated series. Perform sampling.
- the time-series processing unit 154 detects a change point in which the tendency of the value to be predicted changes significantly, and divides the construction data set based on the change point.
- time series processing unit 154 A detailed configuration example of the time series processing unit 154 will be described later with reference to FIG.
- the leak detection correction unit 155 detects the possibility of leakage of the construction data set based on the nature of the construction data set and the past cases stored in the past case storage unit 121. In addition, the leak detection correction unit 155, if necessary, corrects the construction data set based on the possibility of the detected leak.
- leak detection correction unit 155 A detailed configuration example of the leak detection correction unit 155 will be described later with reference to FIG.
- the learning unit 156 learns the prediction model based on the problem setting extracted by the extraction unit 153 and the construction data set extracted by the extraction unit 153 or the construction data set corrected by the leak detection correction unit 155.
- the learning unit 156 learns a prediction model corresponding to each of the plurality of problem settings.
- the learning unit 156 divides the construction data set into learning data and test data.
- the learning unit 156 converts the learning data into a feature vector.
- the learning unit 156 generates a prediction model by solving a classification or regression problem by machine learning based on a feature vector and a prediction target.
- the machine learning described above is an example, and the learning unit 156 may learn a prediction model based on various known techniques.
- the learning unit 156 divides the construction data set, but this is an example.
- the extraction unit 153 may construct each of the training data set and the test data set.
- the evaluation unit 157 evaluates the prediction model generated by the learning unit 156. When the learning unit 156 generates a plurality of prediction models, the evaluation unit 157 evaluates each of the plurality of prediction models.
- the evaluation unit 157 evaluates the prediction model using the evaluation index based on the prediction model and the test data.
- the evaluation index is, for example, AUC for binary classification, Accuracy for multi-value classification, MAE for regression, and the like.
- the above-mentioned evaluation index is an example, and the evaluation unit 157 may evaluate the prediction model based on various known techniques. For example, the user may specify an evaluation index.
- the prediction unit 158 predicts the business effect when the prediction model is introduced into the business.
- the prediction unit 158 predicts the business effect (hereinafter, also referred to as the prediction effect) when the plurality of prediction models are introduced into the business.
- the prediction unit 158 selects a past case whose past target is the same item as the extraction target extracted by the extraction unit 153 from the past case storage unit 121.
- the prediction unit 158 performs predictive analysis using the "business effect" included in the selected past case as a new "prediction target” (hereinafter, also referred to as an effect prediction target).
- the prediction unit 158 first sets the "business effect” to the "effect prediction target". Next, the prediction unit 158 sets the item related to the "business effect” to the "use item” from the past case.
- the prediction unit 158 may set the "use item” from the items included in both the past case and the structured user data (or the construction data set), for example.
- the prediction unit 158 constructs a data set (hereinafter, also referred to as an effect learning data set) by extracting data corresponding to "use items" from past cases.
- the prediction unit 158 generates a prediction model (hereinafter, also referred to as an effect prediction model) by solving, for example, a regression problem by machine learning based on the effect prediction data set and the “effect prediction target”.
- the prediction unit 158 extracts the data corresponding to the "use item" from the structured user data and constructs a data set (hereinafter, also referred to as an effect prediction data set).
- the prediction unit 158 predicts the business effect when the prediction model generated by the learning unit 156 is introduced into the business based on the effect prediction data set and the generated effect prediction model.
- the prediction unit 158 may predict the business effect based on various known techniques. Further, the construction of the effect prediction data set and the learning of the effect prediction model performed by the prediction unit 158 may be executed by using some functions of the extraction unit 153 and the learning unit 156.
- the collection decision unit 159 determines a data item (hereinafter, also referred to as a proposal item) for which collection is proposed to the user based on past cases and user data for each extracted problem setting. When there are a plurality of problem settings, the collection decision unit 159 determines a proposal item for each of the plurality of problem settings. The collection determination unit 159 may determine a plurality of proposal items for one problem setting.
- the collection decision unit 159 compares the data set of the past case (past data set) with the data set constructed by the extraction unit 153 or the data set corrected by the leak detection correction unit 155 (construction data set).
- the collection decision unit 159 extracts "used items” (hereinafter, also referred to as "uncollected items") included in the past data set and not included in the construction data set.
- the collection decision unit 159 predicts the business effect when "uncollected items" are not used in the past cases. Specifically, the collection decision unit 159 learns the prediction model using the past data set excluding the “uncollected items” and evaluates the accuracy of the prediction model. The collection decision unit 159 recalculates the business effect with the evaluated prediction accuracy. Since the learning, evaluation, and calculation of the business effect of the prediction model here are the same as the processing of the learning unit 156, the evaluation unit 157, and the prediction unit 158, the description thereof will be omitted.
- the collection decision unit 159 determines "uncollected items" whose effects have decreased as proposal items based on the calculated business effect.
- the collection decision unit 159 When the collection decision unit 159 extracts a plurality of "uncollected items", the collection decision unit 159 recalculates the business effect for each "uncollected item”. Then, the collection decision unit 159 determines the “uncollected item” having the largest decrease in the business effect as the proposal item. Alternatively, the collection decision unit 159 may determine "uncollected items” whose amount of decrease in business effect is equal to or greater than the threshold value as the proposed items, or may determine a predetermined number of "uncollected items” as the proposed items. ..
- the collection decision unit 159 may determine the proposal item based on the newly calculated business effect and collection cost. In this case, the collection decision unit 159 determines the difference between the introduction effect obtained by subtracting the collection cost from the business effect calculated by the prediction unit 158 including the "uncollected items” and the business effect calculated without including the "uncollected items”. calculate. The collection decision unit 159 determines the “uncollected item” having a large calculated difference as the proposal item.
- the collection decision unit 159 determines the proposed items including the "collection cost" of the data, so that the information processing apparatus 12 gives priority to the uncollected items whose collection cost is low and the data can be easily collected. Can be suggested to the user.
- the information processing apparatus 12 can propose to the user data collection of uncollected items, which has a high collection cost but has a large business effect when used.
- the collection decision unit 159 learns the prediction model when the "uncollected items" are not used, evaluates the accuracy, and calculates the business effect, but the present invention is not limited to this.
- the learning unit 156, the evaluation unit 157, and the prediction unit 158 may perform learning of the prediction model, accuracy evaluation, and calculation of the business effect, respectively.
- the collection decision unit 159 determines the proposal item based on the result of each unit.
- the collection decision department 159 decides the proposal items based on the business effect, but it is not limited to this.
- the collection decision unit 159 may decide the proposal item based on, for example, the evaluation result of the prediction model. In this case, the collection decision unit 159 evaluates the accuracy of the prediction model learned without using the "uncollected items", and determines the "unused items” with a small decrease in evaluation as the proposed items.
- the contribution calculation unit 160 calculates the contribution indicating which feature amount contributes to the prediction result among the feature amounts of the test data input to the prediction model learned by the learning unit 156. Specifically, the contribution calculation unit 160 removes the feature amount for which the contribution is to be calculated from the input of the prediction model, and calculates the contribution based on the change in the prediction result before and after the removal.
- the contribution degree calculated by the contribution degree calculation unit 160 includes a positive value and a negative value.
- a positive value of contribution means that the set of features contributes positively to the prediction, that is, it improves the prediction probability predicted by the prediction model. Further, when the contribution is a negative value, it means that the set of features contributes negatively to the prediction, that is, the prediction probability predicted by the prediction model is lowered.
- the contribution calculation unit 160 calculates the ratio of the feature amount for which the contribution degree has been calculated to the set (item) of the feature amount. If the calculated ratio is low, it rarely occurs even if the contribution is high, and the utility value for the user is low. Therefore, in the embodiment of the present disclosure, the contribution calculation unit 160 calculates the ratio of the feature amount for which the contribution is calculated, and presents such a ratio to the user. As a result, the user can confirm the contribution of the data in consideration of the degree of occurrence.
- the prediction unit 158, the contribution calculation unit 160, and the collection decision unit 159 calculate the business effect and the contribution, respectively, and determine the proposal items, but it is not necessary to calculate / determine all of them.
- the contribution calculation unit 160 may calculate the contribution, and the calculation of the business effect by the prediction unit 158 and the determination of the proposed item by the collection determination unit 159 may be omitted.
- the contribution calculation unit 160 may calculate the contribution and the prediction unit 158 may calculate the business effect, and the collection decision unit 159 may omit the determination of the proposed item.
- the user may be able to select the process of calculating / determining.
- the display control unit 143 controls the display of various information.
- the display control unit 143 controls the display of various information in the terminal device 11.
- the display control unit 143 generates an image including control information for controlling the display mode. This control information is described by, for example, a script language such as Javascript (registered trademark) or CSS.
- the display control unit 143 provides the terminal device 11 with an image including the above control information, so that the terminal device 11 performs the above-mentioned display process according to the control information.
- the display control unit 143 is not limited to the above, and may control the display of the terminal device 11 by appropriately using various conventional techniques.
- FIG. 9 shows a configuration example of the time series processing unit 154 of FIG.
- the time series processing unit 154 includes a separation unit 201, a resampling unit 202, and a change point detection unit 203.
- the separation unit 201 separates the time series data sets for each series. For example, the separation unit 201 separates the construction data set into a plurality of series of time series data sets based on the degree of duplication of the values of the items related to the date and time (hereinafter referred to as date and time items) in the construction data set.
- the resampling unit 202 When the construction data set is a time series data set, the resampling unit 202 resamples the date and time of the time series data set. Further, when the construction data set includes a plurality of time series data sets, the resampling unit 202 resamples the date and time of the time series data set for each series.
- the change point detection unit 203 detects the change point at which the tendency of the value to be predicted changes significantly. Further, the change point detection unit 203 divides the construction data set based on the detected change points.
- FIG. 10 shows a configuration example of the leak detection correction unit 155 of FIG.
- the leak detection correction unit 155 includes an occurrence date / time identification unit 221, an acquisition date / time identification unit 222, a prediction execution timing setting unit 223, a leak detection unit 224, and a leak correction unit 225.
- the occurrence date / time specifying unit 221 specifies the occurrence date / time of each data included in the construction data set.
- the date and time of occurrence indicates, for example, the date and time when the event related to the data really occurred.
- the date and time of occurrence is represented by, for example, one or more of a year, a month, a day, and a time.
- the acquisition date / time specifying unit 222 specifies the acquisition date / time of each data included in the construction data set.
- the income date and time indicates, for example, the date and time when the user can actually acquire the data, or the date and time when the user actually acquired the data.
- the acquisition date and time is represented by, for example, one or more of a year, a month, a day, and a time.
- the date and time when the data on the number of shipments on December 8, 2019 was generated will be December 8, 2019.
- the number of shipments of a certain product is recorded at each factory, aggregated at the end of the month, and actually made available to the data analyst (user) on the 15th of the following month.
- the income date and time of the shipment quantity data will be January 15, 2020.
- the prediction execution timing setting unit 223 sets the timing (hereinafter referred to as the prediction execution timing) for predicting the prediction target based on the nature of the prediction target and the past cases stored in the past case storage unit 121. To do.
- the leak detection unit 224 detects the possibility of leakage of the construction data set based on the nature of the construction data set and the past cases stored in the past case storage unit 121.
- the properties of the construction data set include, for example, the properties of the entire construction data set, the properties of each item of the construction data set, and the properties of each data included in the construction data set.
- the leak correction unit 225 may include, for example, an instruction from the user input via the terminal device 11, past cases stored in the past case storage unit 121, profile information stored in the user profile storage unit 123, and the like. Based on, the construction data set is corrected so as to eliminate the possibility of a leak detected by the leak detection unit 224.
- step S101 the acquisition unit 141 acquires past cases and user data from the storage unit 102 as described above.
- step S102 the time prediction unit 151 predicts the processing time (prediction processing time) required for the analysis process using a part of the acquired user data.
- step S103 the interpretation unit 152 generates structured data by analyzing and structuring the user data as described above.
- step S104 the extraction unit 153 extracts the problem setting based on the structured data and the past case as described above.
- step S105 the extraction unit 153 constructs a data set (construction data set) according to the extracted problem setting, as described above.
- step S106 the time series processing unit 154 executes the time series processing.
- step S151 the separation unit 201 determines whether or not it is a time series data set. For example, if the separation unit 201 includes a date and time item related to "date and time" among the items of the construction data set, the separation unit 201 determines that the construction data set is a time series data set, and the process proceeds to step S152.
- the date and time item is, for example, an item in which data is represented by at least one of a year, a month, a day, a time, and a day of the week.
- FIG. 13 shows an example of a construction dataset.
- This construction dataset contains items such as "date”, “sales area”, “number of sales”, “event (planned)", and “advertising cost (planned)".
- "date” is the date and time item.
- the category is, for example, a category, a classification, a genre, or the like.
- a set of data including the data of each item, which is the data for one line of the construction data set, is referred to as a record.
- step S152 the separation unit 201 counts the number of records whose date and time overlap. For example, the separation unit 201 sorts the construction data set by date and time items. Then, the separation unit 201 counts the number of records whose date and time item values overlap with other records.
- the number of records whose "date" value overlaps with other records is counted.
- step S153 the separation unit 201 determines whether or not there are records having overlapping dates and times at a certain ratio or more. Specifically, the separation unit 201 calculates the ratio of the number of records obtained in the process of step S152 to the total number of records in the construction data set. When the calculated ratio is equal to or greater than a predetermined threshold value, the separation unit 201 determines that there are records having overlapping dates and times at a certain ratio or more, and the process proceeds to step S154.
- step S154 the separation unit 201 counts the number of records whose date and time do not overlap in each category. For example, the separation unit 201 divides the construction data set into a plurality of series for each category item value (hereinafter referred to as a category value). Next, the separation unit 201 sorts the data set of each series by the date and time item. Next, the separation unit 201 counts the number of records whose date and time do not overlap with other records for each series of data sets. Then, the separation unit 201 totals the number of records counted for each data set of each series.
- category value category item value
- the construction data set of FIG. 8 is divided into a data set of a series in which the value of "sales area" is "region A” and a data set of a series in which the value of "region B” is.
- the number of records whose "date” value does not overlap with other records in the "region A” series dataset is counted.
- the "Region B” series dataset the number of records whose "Date” value does not overlap with other records is counted.
- the number of records counted in the data set of the series of "Region A” and the number of records counted in the data set of the series of "Region B" are totaled.
- step S155 the separation unit 201 determines whether or not the ratio of records whose dates and times do not overlap in each category is equal to or higher than a certain level. Specifically, the separation unit 201 calculates the ratio of the number of records obtained in the process of step S154 to the total number of records in the construction data set. When the calculated ratio is equal to or greater than a predetermined threshold value, the separation unit 201 determines that the ratio of records whose dates and times do not overlap in each category is equal to or greater than a certain value, and the process proceeds to step S156.
- step S156 the separation unit 201 determines that the construction data set includes a plurality of time series data sets. That is, the separation unit 201 determines that the construction data set is separated into a plurality of time series data sets for each category value.
- the ratio of records in which the value of "date" overlaps with other records is high.
- the value of "date” is set in each series data set.
- the ratio of records that overlap with other records decreases. Therefore, it is determined that the construction data set of FIG. 13 is separated into two time-series data sets, one in which the "sales area” is the "region A” series and the other in which the "sales area” is the "region B" series.
- step S155 if it is determined in step S155 that the ratio of records whose date and time do not overlap in each category is less than a certain value, the process proceeds to step S157.
- step S153 If it is determined in step S153 that there are no records with overlapping dates and times at a certain ratio or more, the process proceeds to step S157.
- step S157 the separation unit 201 determines that the construction data set is a single-series time-series data set.
- step S158 the resampling unit 202 resamples the date and time of the data set.
- the resampling unit 202 divides the construction data set into each series and sorts the data set of each series by the date and time item.
- the construction data set is a single series time series data set, the entire construction data set is one series.
- the resampling unit 202 calculates the time interval between adjacent records for each series. Then, the resampling unit 202 obtains the time interval having the highest frequency of appearance (hereinafter, referred to as the most frequent time interval) in all the series.
- the resampling unit 202 sets the time unit based on the most frequent time interval. For example, if the mode interval ⁇ 60 seconds, the time unit is set to "seconds". If 60 seconds ⁇ mode interval ⁇ 60 x 60 seconds, the time unit is set to "minutes”. When 60 ⁇ 60 seconds ⁇ most frequent time interval ⁇ 24 ⁇ 60 ⁇ 60 seconds, the time unit is set to “hour”. When 24 ⁇ 60 ⁇ 60 seconds ⁇ most frequent time interval ⁇ 365 ⁇ 24 ⁇ 60 ⁇ 60 seconds, the time unit is set to “day”. If 365 ⁇ 24 ⁇ 60 ⁇ 60 seconds ⁇ the most frequent time interval, the time unit is set to “year”.
- the resampling unit 202 resamples the data set for each series in the set time unit. That is, the value of the date and time item of each series data set is reset based on the set time unit.
- step S159 the resampling unit 202 interpolates the missing value to be predicted. Specifically, the resampling unit 202 interpolates the value at the date and time when the value to be predicted is missing for each series.
- the interpolation method for example, a method suitable for the property of the prediction target is used from among pre-value inheritance, moving average, and the like.
- step S158 and step S159 the data to be predicted in the loss period is interpolated.
- FIG. 15 shows the sales of each series after resampling the data sets of the series in which the “sales area” is “region A” and the series in which the “sales area” is “region B” in the construction data set of FIG. It is a graph which shows the transition of a number in time series. The horizontal axis shows the date, and the vertical axis shows the number of sales.
- step S160 the change point detection unit 203 determines whether or not there is a change point to be predicted.
- the change point detection unit 203 detects a point (date and time) at which the tendency of the value of the prediction target changes significantly when the values of the prediction target are arranged in a time series for each series.
- a method suitable for the property of the prediction target is used.
- a method is conceivable in which time-series prediction of the prediction target is repeatedly performed in ascending order of the date and time, and a point where the prediction and the actual value deviate significantly is detected as a change point.
- the change point detection unit 203 determines that the change point to be predicted exists, and the process proceeds to step S161.
- step S161 the change point detection unit 203 divides the data set based on the change point. Specifically, the change point detection unit 203 divides the construction data set into a plurality of periods for each change point detected in any of the series.
- the prediction model is affected by the data before the change of the prediction target occurs. Is prevented. This prevents the prediction accuracy of the prediction model from deteriorating during actual operation.
- steps S154 to S161 may be performed for each category item.
- the construction data set is separated and resampled for each category item.
- step S160 if the change point detection unit 203 does not detect a change point in any of the series, it determines that the change point to be predicted does not exist, the process of step S161 is skipped, and the process is stepped. Proceed to S162.
- step S162 the display control unit 143 presents the result of the time series processing to the user.
- the display control unit 143 causes the terminal device 11 to display the image of FIG.
- a graph 301, a list box 302, a list box 303, a check box 304, a list box 305, a “back” button 306, a “cancel” button 307, and an “execute” button 308 are displayed. ..
- Graph 301 shows a graph of the number of sales in the series of "Region A” and the series of "Region B" in the construction data set of FIG. That is, the construction data set is separated into two time series data sets of "region A” and "region B", and the result of resampling the date and time of each time series data set is shown.
- the list box 302 and the list box 303 are used for setting the forecast period.
- the forecast period is a period for which a forecast target (for example, the number of sales) is forecast.
- the list box 302 is used to set the start date and time of the prediction period
- the list box 303 is used to set the end date and time of the prediction period. That is, the prediction period is set from the date and time set by the list box 302 to the date and time set by the list box 303. In this example, a period from 10 days ahead to 40 days ahead is set as the prediction period based on the prediction execution timing.
- the check box 304 is used when the prediction period is directly input without using the list box 302 and the list box 303. That is, when the check box 304 is checked, the prediction period can be directly input. On the other hand, when the check box 304 is unchecked, the reservation period can be input by the list box 302 and the list box 303.
- the list box 305 is used to select a series that separates the construction data set into a plurality of time series data sets. For example, if the construction data set can be separated into a plurality of time series data sets by a plurality of category items, the name of each category item is displayed as a series name in the list box 305. Then, the graph of the time series data set separated based on the category items selected by the list box 305 is displayed as the graph 301.
- the "back” button 306 is used to return to the image before the transition to the image of FIG.
- the "Cancel” button 307 is used to cancel the training and evaluation execution of the prediction model.
- the "execute" button 307 is used to execute the learning and evaluation of the prediction model.
- step S151 when the date and time item is not included in the items of the construction data set, the separation unit 201 determines that the construction data set is a non-time series data set, and steps S151 to S162. Processing is skipped and the time series processing ends.
- step S107 the leak detection correction unit 155 executes the leak detection process.
- step S201 the occurrence date / time specifying unit 221 and the acquisition date / time specifying unit 222 specify the data generation date / time and the acquisition date / time.
- the occurrence date / time specifying unit 221 specifies the occurrence date / time of each data included in the construction data set.
- the date and time when the data was generated is instructed by the user or is self-evident.
- the occurrence date / time specifying unit 221 sets the occurrence date / time as it is as the data generation date / time. For example, the date and time when the data of "number of shipments" of "December 8, 2019" is generated is specified on December 8, 2019.
- the occurrence date and time specifying unit 221 estimates the occurrence date and time based on, for example, the content of the data.
- the occurrence date / time specifying unit 221 estimates the occurrence date / time of the data based on the database related to the phenomenon related to the data and the occurrence time of various phenomena. For example, when a building is shown in the image data, if the date of construction of the building is known, it is estimated that the date and time of occurrence of the image data is after the date of construction. For example, when the item of the construction data set includes "song title", it is estimated that the generation date and time of the data included in each record is after the release date of the song with the song title included in the same record. For example, failure data of a manufacturing device can be estimated from the production time and life of the manufacturing device.
- the acquisition date / time specifying unit 222 specifies the acquisition date / time of each data included in the data set.
- Data acquisition date and time is strongly influenced by workflow in business, for example, so it is difficult to estimate using a database like the date and time of occurrence. On the other hand, since the data used in the business is under the control of the business workflow, the acquisition date and time are often recorded.
- the acquisition date / time specifying unit 222 specifies the acquisition date / time of each data based on the time stamp of each data file, Exif (Exchangeable image file format) information, database update time, version control system history, and the like. To do.
- step S202 the prediction execution timing setting unit 223 sets the timing for executing the prediction.
- the timing to execute the prediction of the prediction target (prediction execution timing) is at the earliest after the timing when the last data of the data set used for the prediction is acquired. However, in reality, a certain period called a gap is required between the acquisition of the last data and the execution of the prediction.
- the prediction execution timing setting unit 223 sets how long before the prediction period the prediction of the prediction target is executed, based on the past case and the nature of the prediction target.
- the prediction execution timing setting unit 223 sets the prediction execution timing 3 days before the prediction period.
- the prediction execution timing is set based on the prediction period. For example, in the example of FIG. 16, since the prediction period is set to a period from 10 days ahead to 40 days ahead, the prediction execution timing is set 40 days before the prediction period.
- step S203 the leak detection unit 224 determines whether or not there is data whose occurrence date and time or acquisition date and time is later than the predicted execution timing. For example, the leak detection unit 224 compares the generation date and time and acquisition date and time of each data in the construction data set with the predicted execution timing. Then, when at least one of the occurrence date and time and the acquisition date and time has data after the predicted execution timing, the leak detection unit 224 determines that the occurrence date and time or the acquisition date and time has data after the predicted execution timing, and processes the data. Proceeds to step S204.
- step S204 the leak detection unit 224 determines that there is a possibility of a time-series leak. That is, the leak detection unit 224 determines that there is a possibility of a leak because the data whose occurrence date / time or acquisition date / time is later than the predicted execution timing does not occur or cannot be earned by the predicted execution timing. In this way, the possibility of a time-series leak is detected based on the relationship between the data generation date / time or acquisition date / time and the predicted execution timing.
- step S203 when at least one of the occurrence date and time and the acquisition date and time does not have data after the predicted execution timing, the leak detection unit 224 does not have data whose occurrence date and time or acquisition date and time is later than the predicted execution timing. Is determined, the process of step S204 is skipped, and the process proceeds to step S205. That is, it is determined that the possibility of time-series leak is low.
- step S205 the leak detection unit 224 determines the distinctiveness of each item in the data set. That is, the leak detection unit 224 determines whether or not each item of the data set includes identification information (for example, ID or the like) that can identify each record in the data set.
- identification information for example, ID or the like
- identification items are examples of the properties of items with distinctiveness (hereinafter referred to as identification items).
- -The unique rate of the data in the item is 100% or close to 100%. -Include the character string "ID" in the item name. -Placed in the beginning of the dataset or in a column near the beginning. -It is not a date and time item. -Data is represented by a character string or an integer. -The unit cannot be attached to the data.
- the leak detection unit 224 determines the distinctiveness of each item of the data set based on the rule base or the machine learning base based on the above-mentioned properties and the like.
- step S206 the leak detection unit 224 determines whether or not the identification item exists based on the result of the process in step S205. If it is determined that the identification item exists, the process proceeds to step S207.
- step S207 the leak detection unit 224 clusters the identification items. For example, the leak detection unit 224 performs clustering based on the Levenshtein distance when the data (identification information) of the identification item is represented by a character string.
- the records in the construction data set are classified into multiple clusters based on the identification item (identification information).
- step S208 the leak detection unit 224 examines the correlation between each cluster and the prediction target. For example, the leak detection unit 224 calculates the correlation coefficient between each cluster and the value to be predicted in the record included in each cluster.
- step S209 the leak detection unit 224 determines whether or not there is a correlation between the identification item and the prediction target.
- the correlation coefficient calculated in the process of step S208 is equal to or greater than a predetermined threshold value
- the leak detection unit 224 determines that there is a correlation between the identification item and the prediction target, and the process proceeds to step S210.
- step S210 the leak detection unit 224 determines that there is a possibility of an identification information leak.
- the value of the identification information included in the identification item has no meaning other than uniquely identifying the record.
- the identification information includes characters indicating the user's gender and the prediction target is the user's clothing purchase amount, it is assumed that there is some correlation between the gender and the clothing purchase amount. Therefore, in this case, it may be determined that there is a possibility of an identification information leak.
- step S209 when the correlation coefficient calculated in the process of step S208 is less than a predetermined threshold value, the leak detection unit 224 determines that there is no correlation between the identification item and the prediction target, and skips the process of step S210. Then, the process proceeds to step S211. That is, it is determined that the possibility of the identification information leak is low.
- step S206 If it is determined in step S206 that the identification item does not exist, the processing of steps S207 to S210 is skipped, and the processing proceeds to step S211.
- step S211 the leak detection unit 224 determines whether or not the data set includes a category item.
- the process proceeds to step S212.
- step S212 the leak detection unit 224 performs group K partition cross-validation using groups divided by category.
- the leak detection unit 224 divides the construction data set into a plurality of groups for each category item value (category value). Then, the leak detection unit 224 trains and verifies the prediction model by using each group as test data and the remaining groups as training data. As a result, the prediction accuracy of the prediction model when each group is used for the test data can be obtained.
- the leak detection unit 224 determines whether or not there is a large difference in prediction accuracy between the groups. Specifically, the leak detection unit 224 compares the prediction accuracy of the prediction model when each group is used for the test data. For example, when there is a significant difference in the prediction accuracy of each group, the leak detection unit 224 determines that the difference in the prediction accuracy between the groups is large, and the process proceeds to step S214. Alternatively, for example, when the difference between the maximum value and the minimum value of the prediction accuracy of each group is equal to or greater than a predetermined threshold value, the leak detection unit 224 determines that the difference in prediction accuracy between the groups is large, and the process proceeds to step S214. move on.
- step S214 the leak detection unit 224 determines that there is a possibility of a category leak. That is, while the data of the category item may contain useful information for predictive analysis, it is possible to solve the set problem with high accuracy based on the data of the category item with a simple rule base. There is sex. Since such a problem is practically meaningless, it is determined that the category item may be leaked.
- step S213 the leak detection unit 224 determines that, for example, if there is no significant difference in the prediction accuracy between the groups, the difference in the prediction accuracy between the groups is small, and the processing in step S214 is skipped. The process proceeds to step S215.
- the leak detection unit 224 determines that the difference in prediction accuracy between the groups is small, and the process of step S214 is performed. It is skipped and the process proceeds to step S215. That is, it is determined that the possibility of category leak is low.
- step S211 If it is determined in step S211 that the data set does not include the category item, the processing of steps S212 to S214 is skipped, and the processing proceeds to step S215.
- step S215 it is determined whether or not the construction data set is a time series data set, as in the process of step S151 of FIG. If it is determined that the construction data set is a non-time series data set, the process proceeds to step S216.
- step S216 the leak detection unit 224 performs holdout verification and K-fold cross-validation.
- the leak detection unit 224 performs holdout verification while maintaining the order (of records) of the construction data set. For example, as shown in FIG. 19, the leak detection unit 224 sets a predetermined ratio (for example, 80%) of records from the beginning of the construction data set as training data (train), and sets the remaining records for testing. It is set in the data (holdout), and the prediction model is trained and verified. As a result, the prediction accuracy of the prediction model is required.
- a predetermined ratio for example, 80%
- the leak detection unit 224 performs K-validation cross-validation while maintaining the order (of records) of the construction data set. For example, the leak detection unit 224 divides the construction data set into three groups (fold1 to fold3), as shown in FIG. Then, the leak detection unit 224 trains and verifies the prediction model by using each group as test data and the remaining groups as training data. As a result, the prediction accuracy of the prediction model when each group is used for the test data can be obtained.
- step S217 the leak detection unit 224 determines whether or not there is a large difference between the result of the holdout verification and the result of the K division cross-validation. Specifically, the leak detection unit 224 compares the prediction accuracy in the holdout verification with the plurality of prediction accuracy when each group is used as the test data in the K-fold cross-validation.
- the leak detection unit 224 determines the result of hold-out verification and the result of K-validation cross-validation. It is determined that the difference between the two is large, and the process proceeds to step S218.
- the leak detection unit 224 calculates the difference between each prediction accuracy in K-fold cross-validation and the prediction accuracy in hold-out verification. Then, when the maximum value of the calculated difference in prediction accuracy is equal to or greater than a predetermined threshold value, the leak detection unit 224 determines that the difference between the result of holdout verification and the result of K-fold cross-validation is large, and processes the result. Proceeds to step S218.
- step S2128 the leak detection unit 224 determines that there is a possibility of an order leak.
- the constructed data set is a non-time series data set
- predictive analysis is performed on the assumption that the tendency of the data does not change between the time of pre-evaluation and the time of actual operation.
- the accuracy of the prediction model fluctuates greatly depending on how the training data is selected, that premise does not hold.
- step S217 when there is no significant difference between each prediction accuracy in K-fold cross verification and the prediction accuracy in hold-out verification, the leak detection unit 224 sets the result of hold-out verification and the result of K-fold cross verification. It is determined that the difference between the two is small, the process of step S218 is skipped, and the process proceeds to step S219. That is, it is determined that the possibility of a sort order leak is low.
- the leak detection unit 224 determines the result of the holdout verification and the K-validation cross-validation. It is determined that the difference between the result and the result is small, the process of step S218 is skipped, and the process proceeds to step S219. That is, it is determined that the possibility of a sort order leak is low.
- step S215 If it is determined in step S215 that the construction data set is a time series data set, the processing of steps S216 to S218 is skipped, and the processing proceeds to step S219.
- step S219 the leak detection unit 224 detects the possibility of an item name leak based on the item name of the data set.
- the degree of difficulty in predicting or controlling each phenomenon is stored in a database in advance and stored in the storage unit 102.
- the leak detection unit 224 matches the item name of each item of the construction data set with each phenomenon based on the rule base or the similarity of the distributed expression between words. Next, the leak detection unit 224 determines the degree of difficulty in predicting or controlling the phenomenon corresponding to each item of the construction data set based on the above-mentioned database. Then, the leak detection unit 224 determines that there is a possibility of a leak for an item corresponding to a phenomenon in which the difficulty level is equal to or higher than a predetermined threshold value.
- the data set contains item names related to natural phenomena such as "temperature”, “weather”, and “seismic intensity”, and the prediction execution timing is one month or more ahead, this item is set. It is determined that there is a possibility of leakage.
- the leak detection unit 224 detects the possibility of a domain leak based on the domain of the data set. For example, the leak detection unit 224 leaks the construction data set based on the domain of the entire construction data set (for example, sales, production control, personnel, etc.) and referring to past cases that tend to leak in each domain. Detect the possibility of.
- step S108 the learning unit 156 learns the prediction model based on the problem setting and the construction data set as described above.
- step S109 the evaluation unit 157 divides the data set into training data and test data as described above, and generates a prediction model using the training data.
- the evaluation unit 157 evaluates the prediction model using the test data.
- step S110 the prediction unit 158 predicts the business effect when the prediction model is introduced into the business, as described above.
- step S111 the collection decision unit 159 determines items that may increase the business effect when added to the data set as proposal items based on past cases.
- step S112 the contribution calculation unit 160 calculates the contribution of the feature amount of the test data as described above.
- step S113 the display control unit 143 presents the processing result to the user.
- the display control unit 143 causes the terminal device 11 to display an image showing the analysis processing result of FIG. 20.
- the contribution of each item such as "last week's shipments", "day of the week", and "date” is displayed in a bar graph. Further, as described above, the contribution includes a positive value and a negative value. Therefore, the value obtained by combining the total value of the positive values and the total value of the negative values is displayed as a bar graph.
- icons 331 to 333 indicating that there is a possibility of leakage are displayed for the items of "number of shipments last week", "temperature”, and "humidity” that are determined to have a possibility of leakage. There is.
- the window 334 is displayed for the item of "Number of shipments last week” that was determined to have a possibility of leakage.
- the "number of shipments last week” has a high possibility of leaking, the contribution is very high, and the reason why it is determined that the possibility of leaking is high is displayed.
- the reason is that the "number of shipments last week" cannot be obtained one month ago (predicted execution timing).
- the window 335 is displayed for the items of "temperature” and “humidity” that are determined to have a possibility of leakage.
- “temperature” and “humidity” are displayed with a high possibility of leakage and the reason why it is determined that the possibility of leakage is high.
- the reason is that it is difficult to obtain "temperature” and "humidity” one month before (predicted execution timing).
- a window 336 is displayed at the bottom of the image.
- window 336 there is an advantage that the prediction accuracy before and after removing the item that may leak, and the same prediction accuracy can be maintained in actual operation by removing the item that may leak. It is shown.
- a menu is displayed for the user to select whether or not to apply automatic leak correction.
- step S114 the information processing unit 142 determines whether or not to correct the data set or the problem setting.
- the leak correction unit 225 determines in the window 336 of the image of FIG. 20 that the construction data set is corrected when the user selects to apply the automatic leak correction, and the process proceeds to step S115.
- the leak correction unit 225 includes items in the construction data set in which the possibility of leakage is equal to or higher than a predetermined threshold value in the window 336 of the image of FIG. 20 even if the user does not select to apply the automatic leak correction. If so, it is determined that the construction data set is to be corrected, and the process proceeds to step S115.
- the learning unit 156 determines that the construction data set or the problem setting is corrected when the command for correcting the construction data set or the problem setting is input via the terminal device 11, the learning unit 156 determines that the construction data set or the problem setting is corrected, and the process proceeds to step S115.
- step S115 the information processing unit 142 corrects the data set or problem setting.
- the leak correction unit 225 removes all the items that may leak from the construction data set when the user selects to apply the automatic leak correction.
- the leak correction unit 225 removes items having a possibility of leakage of a predetermined threshold value or more from the construction data set when the user does not select the application of automatic leak correction.
- the learning unit 156 corrects the construction data set or the problem setting according to the command.
- step S106 the process returns to step S106, and the processes of steps S106 to S115 are repeatedly executed until it is determined in step S114 that the data set or the problem setting is not corrected.
- the prediction model is trained and evaluated based on the corrected construction data set or problem setting.
- step S114 if it is determined in step S114 that the data set and the problem setting are not corrected, the information processing ends.
- the time prediction unit 151 may predict the processing time at the timing when the processing of each step is completed.
- the information processing device 12 when the extraction unit 153 extracts a plurality of problem settings, the information processing device 12 repeatedly executes the processes of steps S105 to S115 for each problem setting to perform analysis processing for all the problem settings. You may want to do it.
- a predictive analysis system that automates the process from problem setting to predictive analysis, such as information processing system 1
- data collection and model tuning are performed in the direction of improving accuracy. Leaks are likely to occur.
- items and data that may leak in the constructed data set can be automatically detected with high accuracy.
- corrections for detected leaks can be made automatically or manually (manually).
- each of the above configurations is an example, and the information processing system 1 may have any system configuration as long as it is possible to extract problem settings and construct a data set based on past cases and user data.
- the information processing device 12 and the terminal device 11 may be integrated.
- each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured.
- feedback from each user regarding the detection result of the possibility of leakage may be accumulated in the user profile storage unit 123, and the correction method at the time of automatic leak correction may be personalized for each user.
- the items to be removed from the construction data set at the time of automatic leak correction may be personalized for each user.
- the possibility of the leak may be determined by integrating the possibility of the plurality of types of leaks described above.
- the data set is a time-series data set in which records are arranged at predetermined time intervals, and the time-series predictive analysis processing is performed. Good.
- the construction data set may be divided into a plurality of groups based on the values of items other than the category items, and group K division cross-validation may be performed. Further, for example, the construction data set may be divided into a plurality of groups based on the combination of the values of two or more items, and the group K division cross-validation may be performed.
- step S216 of FIG. 18 only K-fold cross-validation is performed, and if the difference in prediction accuracy between groups in K-fold cross-validation is large, it is determined that there is a possibility of a sort order leak.
- the possibility of a sort order leak may be detected based on the difference.
- the prediction model may be trained using different construction data sets, and the actual prediction analysis may be performed using different prediction models for each prediction period. For example, when performing predictive analysis after 1 to 12 months, the prediction model is dynamically switched after 1 month, 2 to 4 months, 5 to 8 months, and 9 to 12 months. Predictive analytics may be performed.
- an image displaying the detection result of the possibility of leakage for example, in addition to the items determined to have the possibility of leakage, for example, the data determined to have the possibility of leakage and the reason thereof. May be displayed.
- FIG. 21 is a hardware configuration diagram showing an example of a computer 1000 that realizes the functions of information processing devices such as the information processing device 12 and the terminal device 11.
- the computer 1000 includes a CPU 1100, a RAM 1200, a ROM (Read Only Memory) 1300, an HDD (Hard Disk Drive) 1400, a communication interface 1500, and an input / output interface 1600.
- Each part of the computer 1000 is connected by a bus 1050.
- the CPU 1100 operates based on the program stored in the ROM 1300 or the HDD 1400, and controls each part. For example, the CPU 1100 expands the program stored in the ROM 1300 or the HDD 1400 into the RAM 1200 and executes processing corresponding to various programs.
- the ROM 1300 stores a boot program such as a BIOS (Basic Input Output System) executed by the CPU 1100 when the computer 1000 is started, a program that depends on the hardware of the computer 1000, and the like.
- BIOS Basic Input Output System
- the HDD 1400 is a computer-readable recording medium that non-temporarily records a program executed by the CPU 1100 and data used by the program.
- the HDD 1400 is a recording medium for recording a program according to the present disclosure, which is an example of program data 1450.
- the communication interface 1500 is an interface for the computer 1000 to connect to an external network 1550 (for example, the Internet).
- the CPU 1100 receives data from another device or transmits data generated by the CPU 1100 to another device via the communication interface 1500.
- the input / output interface 1600 is an interface for connecting the input / output device 1650 and the computer 1000.
- the CPU 1100 receives data from an input device such as a keyboard or mouse via the input / output interface 1600. Further, the CPU 1100 transmits data to an output device such as a display, a speaker, or a printer via the input / output interface 1600. Further, the input / output interface 1600 may function as a media interface for reading a program or the like recorded on a predetermined recording medium (media).
- the media is, for example, an optical recording medium such as DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk), a magneto-optical recording medium such as MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.
- an optical recording medium such as DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk)
- a magneto-optical recording medium such as MO (Magneto-Optical disk)
- tape medium such as DVD (Digital Versatile Disc) or PD (Phase change rewritable Disk)
- MO Magneto-optical disk
- the CPU 1100 of the computer 1000 realizes the functions of the control unit 103 and the like by executing the information processing program loaded on the RAM 1200.
- the HDD 1400 stores the program related to the present disclosure and the data in the storage unit 102.
- the CPU 1100 reads the program data 1450 from the HDD 1400 and executes the program, but as another example, these programs may be acquired from another device via the external network 1550.
- the program executed by the computer may be a program that is processed in chronological order according to the order described in this specification, or may be a program that is processed in parallel or at a necessary timing such as when a call is made. It may be a program in which processing is performed.
- the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a device in which a plurality of modules are housed in one housing are both systems. ..
- the embodiment of the present technology is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present technology.
- this technology can have a cloud computing configuration in which one function is shared by a plurality of devices via a network and processed jointly.
- each step described in the above flowchart can be executed by one device or shared by a plurality of devices.
- one step includes a plurality of processes
- the plurality of processes included in the one step can be executed by one device or shared by a plurality of devices.
- An information processing device including a leak detection unit that detects a possibility of leakage of the data set based on the nature of the data set used for learning a prediction model used for predictive analysis.
- the leak detection unit detects the possibility of a leak in the data set based on the relationship between the generation date and time or acquisition date and time of the data included in the data set and the timing of performing predictive analysis. Information processing device.
- the leak detection unit detects the possibility of leakage of the data set based on the correlation between the distinguishable identification item of the data set and the prediction target of the prediction analysis (1) to (3).
- the information processing device according to any one of. (5)
- the information processing device according to (4) above, wherein the leak detection unit determines that there is a possibility of a leak in the identification item when there is a correlation between the identification item and the prediction target.
- the leak detection unit divides the data set into a plurality of groups according to the value of at least one item of the data set, and based on the result of performing group K division cross-validation, there is a possibility of data leakage of the item.
- the information processing apparatus according to any one of (1) to (5) above.
- the information processing device (7)
- the leak detection unit detects the possibility of data leakage of the item based on the difference in prediction accuracy between the groups.
- the leak detection unit divides the data set into training data and test data in a plurality of different patterns while maintaining the order of the data sets, and verifies the prediction model for each pattern.
- the information processing apparatus according to any one of (1) to (7) above, which detects the possibility of leakage of the data set based on the above.
- the information processing apparatus detects the possibility of leakage in the order of the data sets based on the difference in prediction accuracy between the patterns.
- the information processing apparatus according to any one of (1) to (13), further comprising a leak correction unit that corrects the data set based on the detection result of the possibility of leakage of the data set.
- the leak correction unit removes items that may leak from the data set.
- the leak correction unit personalizes the correction method of the data set based on feedback from the user regarding the detection result of the possibility of leakage of the data set. .. (17)
- One of the above (1) to (16) further including a separation unit that separates the data set into a plurality of series of time series data sets based on the degree of duplication of values of items related to the date and time in the data set. The information processing device described.
- the information processing apparatus according to any one of (1) to (17), further comprising a resampling unit for resampling the data set based on the interval between the values of items related to the date and time in the data set.
- Information processing device An information processing method that detects the possibility of leakage of the data set based on the properties of the data set used for learning the prediction model used for predictive analysis.
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
1.実施形態
1-1.背景
1-2.実施形態に係る情報処理システムの構成例
1-3.実施形態に係る情報処理の概要
1-4.実施形態に係る情報処理装置の構成例
1-5.実施形態に係る時系列処理部の構成例
1-6.実施形態に係るリーク検出補正部の構成例
1-7.実施形態に係る情報処理の手順
2.変形例
3.ハードウェア構成
[1-1.背景]
まず、本開示の実施の形態について詳細に説明する前に、本開示の実施形態の背景として、予測分析をビジネスに活用するためのワークフローについて説明する。
図2は、本開示の実施形態に係る情報処理システム1の構成例を示す図である。情報処理システム1は、端末装置11、及び、情報処理装置12を備える。端末装置11と情報処理装置12とは所定の通信網(ネットワークN)を介して、有線または無線により通信可能に接続される。なお、情報処理システム1は、複数台の端末装置11や、複数台の情報処理装置12を備えていてもよい。
以下、図3乃至図5を用いて情報処理装置12が行う分析処理の概要について説明する。
次に、図7を用いて、図2の情報処理装置12の構成例について説明する。
通信部101は、例えば、NIC(Network Interface Card)等によって実現される。そして、通信部101は、ネットワークNと有線または無線で接続され、端末装置11や外部サーバ等の他の情報処理装置との間で情報の送受信を行う。
記憶部102は、例えば、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。記憶部102は、図7に示すように、過去事例記憶部121、ユーザデータ記憶部122、及び、ユーザプロファイル記憶部123を備える。なお、図示を省略するが、記憶部102は、端末装置11に提供する画像の基となる画像等の種々の情報を記憶してもよい。
過去事例記憶部121は、過去事例を記憶する。過去事例には、過去に行った予測分析に関する情報が含まれる。過去事例記憶部121は、例えば過去に予測分析をビジネス導入した際の事例を記憶する。なお、過去事例は、情報処理装置12が保持せずに、外部サーバ等から適宜取得してもよい。
図7に戻り、ユーザデータ記憶部122について説明する。ユーザデータは、ユーザが作成または収集した種々のデータである。ユーザデータのデータ形式には、例えば下記に挙げるように多岐に渡る形式が想定される。
・メディア-RGB画像、深度画像、ベクタ画像、動画、音声など
・複合文書-オフィス文書、PDF、Webページ、電子メールなど
・センサデータ-現在位置、加速度、心拍数など
・アプリケーションデータ-起動ログ、処理中のファイル情報など
・データベース-リレーショナルデータベース、キーバリューストアなど
次に、ユーザプロファイル記憶部123について説明する。ユーザプロファイル記憶部123は、ユーザに関するプロファイル情報を記憶する。プロファイル情報は、例えばユーザ情報やユーザ事例情報を含む。
制御部103は、例えば、CPU(Central Processing Unit)やMPU(Micro Processing Unit)等によって、情報処理装置12内部に記憶されたプログラム(例えば、本開示に係るプログラム)がRAM等を作業領域として実行されることにより実現される。また、制御部103は、コントローラ(controller)であり、例えば、ASIC(Application Specific Integrated Circuit)やFPGA(Field Programmable GateArray)等の集積回路により実現される。
取得部141は、記憶部102から各種情報を取得する。例えば、取得部141は、過去事例記憶部121から複数の過去事例を取得する。例えば、取得部141は、ユーザデータ記憶部122からユーザデータを取得する。例えば、取得部141は、ユーザプロファイル記憶部123からプロファイル情報を取得する。なお、取得部141は、過去事例記憶部121、ユーザデータ記憶部122およびユーザプロファイル記憶部123に代えて、外部サーバや端末装置11等から各種情報を取得してもよい。
時間予測部151は、取得部141がデータの取得を開始してから問題設定抽出等の処理結果をユーザに提示するまでに制御部103が行う分析処理にかかる時間を予測する。
解釈部152は、取得部141がユーザデータ記憶部122から取得したユーザデータを解析し、構造化する。まず、解釈部152が行うデータ解析について説明する。
続いて、抽出部153は、解釈部152が構造化したユーザデータ(以下、構造化データともいう)および取得部141が取得した過去事例に基づき、予測分析における問題設定を抽出する。問題設定は「何のデータ項目を使用するか」という複数の「使用項目」(説明変数)と、「何を予測するか」という1つの「予測対象」(目的変数)とを含む。
時系列処理部154は、抽出部153により構築された構築データセットが時系列のデータセット(以下、時系列データセットと称する)である場合、時系列データセットの日時のリサンプリングを行う。例えば、時系列処理部154は、時系列データセットのサンプリング間隔(時間間隔)を補正したり、予測対象の欠損値を補間したりする。
リーク検出補正部155は、構築データセットの性質、及び、過去事例記憶部121に記憶されている過去事例等に基づいて、構築データセットのリークの可能性の検出を行う。また、リーク検出補正部155、必要に応じて、検出したリークの可能性に基づいて、構築データセットを補正する。
学習部156は、抽出部153が抽出した問題設定と、抽出部153が抽出した構築データセット又はリーク検出補正部155により補正された構築データセットとに基づき、予測モデルを学習する。抽出部153が複数の問題設定を抽出する場合、学習部156は複数の問題設定それぞれに対応する予測モデルを学習する。
評価部157は、学習部156が生成した予測モデルを評価する。学習部156が複数の予測モデルを生成する場合、評価部157は複数の予測モデルそれぞれについて評価を行う。
予測部158は、予測モデルをビジネスに導入した場合のビジネス効果を予測する。学習部156が複数の予測モデルを生成する場合、予測部158は複数の予測モデルをビジネスに導入した場合のビジネス効果(以下、予測効果ともいう)をそれぞれ予測する。
収集決定部159は、抽出した問題設定ごとに、過去事例およびユーザデータに基づき、ユーザに収集を提案するデータ項目(以下、提案項目ともいう)を決定する。問題設定が複数ある場合、収集決定部159は複数の問題設定ごとに提案項目を決定する。なお、収集決定部159が1つの問題設定に対して複数の提案項目を決定してもよい。
寄与度算出部160は、学習部156で学習した予測モデルに入力されるテスト用データの特徴量のうち、どの特徴量がどれだけ予測結果に寄与するかを示す寄与度を算出する。具体的に、寄与度算出部160は、寄与度の算出対象とする特徴量を予測モデルの入力から除去して、除去する前後での予測結果の変化に基づいて寄与度を算出する。
表示制御部143は、各種情報の表示を制御する。表示制御部143は、端末装置11における各種情報の表示を制御する。表示制御部143は、表示態様を制御する制御情報を含む画像を生成する。この制御情報は、例えば、JavaScript(登録商標)やCSS等のスクリプト言語により記述される。表示制御部143は、上記のような制御情報を含む画像を端末装置11に提供することにより、制御情報にしたがって端末装置11に上述した表示処理を行わせる。なお、表示制御部143は、上記に限らず、種々の従来技術を適宜用いて、端末装置11の表示を制御してもよい。
図9は、図7の時系列処理部154の構成例を示している。
分離部201は、構築データセットが複数の異なる系列の時系列データセットを含む場合、系列毎に時系列データセットを分離する。例えば、分離部201は、構築データセット内の日時に関する項目(以下、日時項目と称する)の値が重複する度合いに基づいて、構築データセットを複数の系列の時系列データセットに分離する。
リサンプリング部202は、構築データセットが時系列データセットである場合、時系列データセットの日時のリサンプリングを行う。また、リサンプリング部202は、構築データセットが複数の時系列データセットを含む場合、系列毎に時系列データセットの日時のリサンプリングを行う。
変化点検出部203は、構築データセットが時系列データセットである場合、予測対象の値の傾向が大きく変化する変化点の検出を行う。また、変化点検出部203は、検出した変化点に基づいて構築データセットを分割する。
図10は、図7のリーク検出補正部155の構成例を示している。
発生日時特定部221は、構築データセットに含まれる各データの発生日時を特定する。発生日時は、例えば、データに関連する事象が真に発生した日時を示す。発生日時は、例えば、年、月、日、及び、時刻のうちの1つ以上により表される。
取得日時特定部222は、構築データセットに含まれる各データの取得日時を特定する。所得日時は、例えば、ユーザが実際にデータを取得できるようになった日時、又は、ユーザが実際にデータを取得した日時を示す。取得日時は、例えば、年、月、日、及び、時刻のうちの1つ以上により表される。
予測実行タイミング設定部223は、予測対象の性質、及び、過去事例記憶部121に記憶されている過去事例等に基づいて、予測対象の予測を行うタイミング(以下、予測実行タイミングと称する)を設定する。
リーク検出部224は、構築データセットの性質、及び、過去事例記憶部121に記憶されている過去事例等に基づいて、構築データセットのリークの可能性の検出を行う。
リーク補正部225は、例えば、端末装置11を介して入力されるユーザからの指示、過去事例記憶部121に記憶されている過去事例、及び、ユーザプロファイル記憶部123に記憶されているプロファイル情報等に基づいて、リーク検出部224により検出されたリークの可能性を解消するように、構築データセットを補正する。
次に、図11のフローチャートを用いて、実施形態に係る情報処理の手順について説明する。
・項目名に「ID」の文字列を含む。
・データセットの先頭又は先頭に近い列に配置される。
・日時項目でない。
・データが文字列又は整数により表される。
・データに単位がつかない。
上記の各構成は一例であり、情報処理システム1は、過去事例およびユーザデータに基づき、問題設定の抽出およびデータセットの構築が可能であればどのようなシステム構成であってもよい。例えば、情報処理装置12と端末装置11とが一体であってもよい。
上述してきた各実施形態や変形例に係る情報処理装置12や端末装置11等の情報機器は、例えば図21に示すような構成のコンピュータ1000によって実現される。図21は、情報処理装置12や端末装置11等の情報処理装置の機能を実現するコンピュータ1000の一例を示すハードウェア構成図である。以下、実施形態に係る情報処理装置12を例に挙げて説明する。コンピュータ1000は、CPU1100、RAM1200、ROM(Read Only Memory)1300、HDD(Hard Disk Drive)1400、通信インターフェイス1500、および入出力インターフェイス1600を有する。コンピュータ1000の各部は、バス1050によって接続される。
予測分析に用いる予測モデルの学習に用いるデータセットの性質に基づいて、前記データセットのリークの可能性を検出するリーク検出部を
備える情報処理装置。
(2)
前記リーク検出部は、前記データセットに含まれるデータの発生日時又は取得日時と、予測分析を行うタイミングとの関係に基づいて、前記データセットのリークの可能性を検出する
前記(1)に記載の情報処理装置。
(3)
前記リーク検出部は、前記発生日時又は前記取得日時が前記予測分析を行うタイミングより後であるデータを、リークの可能性があると判定する
前記(2)に記載の情報処理装置。
(4)
前記リーク検出部は、前記データセットの識別性がある識別項目と前記予測分析の予測対象との相関関係に基づいて、前記データセットのリークの可能性を検出する
前記(1)乃至(3)のいずれかに記載の情報処理装置。
(5)
前記リーク検出部は、前記識別項目と前記予測対象との間に相関がある場合、前記識別項目にリークの可能性があると判定する
前記(4)に記載の情報処理装置。
(6)
前記リーク検出部は、前記データセットの少なくとも1つの項目の値により前記データセットを複数のグループに分割し、グループK分割交差検証を行った結果に基づいて、前記項目のデータのリークの可能性を検出する
前記(1)乃至(5)のいずれかに記載の情報処理装置。
(7)
前記リーク検出部は、前記グループ間の予測精度の差に基づいて、前記項目のデータのリークの可能性を検出する
前記(6)に記載の情報処理装置。
(8)
前記リーク検出部は、前記データセットの並び順を保持したまま、複数の異なるパターンで前記データセットを学習用データとテスト用データに分割し、前記パターン毎に前記予測モデルの検証を行った結果に基づいて、前記データセットのリークの可能性を検出する
前記(1)乃至(7)のいずれかに記載の情報処理装置。
(9)
前記リーク検出部は、前記パターン間の予測精度の差に基づいて、前記データセットの並び順のリークの可能性を検出する
前記(8)に記載の情報処理装置。
(10)
前記リーク検出部は、前記データセットの項目名に基づいて、前記データセットのリークの可能性を検出する
前記(1)乃至(9)のいずれかに記載の情報処理装置。
(11)
前記リーク検出部は、前記データセットのドメインに基づいて、前記データセットのリークの可能性を検出する
前記(1)乃至(10)のいずれかに記載の情報処理装置。
(12)
前記データセットのリークの可能性の検出結果の表示を制御する表示制御部を
さらに備える前記(1)乃至(11)のいずれかに記載の情報処理装置。
(13)
前記表示制御部は、前記データセットにおいてリークの可能性がある項目又はデータ、及び、リークの可能性がある理由の表示を制御する
さらに備える前記(12)に記載の情報処理装置。
(14)
前記データセットのリークの可能性の検出結果に基づいて、前記データセットの補正を行うリーク補正部を
さらに備える前記(1)乃至(13)のいずれかに記載の情報処理装置。
(15)
前記リーク補正部は、リークの可能性がある項目を前記データセットから除去する
前記(14)に記載の情報処理装置。
(16)
前記リーク補正部は、前記データセットのリークの可能性の検出結果に対するユーザからのフィードバックに基づいて、前記データセットの補正方法を個人化する
前記(14)又は(15)に記載の情報処理装置。
(17)
前記データセット内の日時に関する項目の値が重複する度合いに基づいて、前記データセットを複数の系列の時系列データセットに分離する分離部を
さらに備える前記(1)乃至(16)のいずれかに記載の情報処理装置。
(18)
前記データセット内の日時に関する項目の値の間隔に基づいて、前記データセットのリサンプリングを行うリサンプリング部を
さらに備える前記(1)乃至(17)のいずれかに記載の情報処理装置。
(19)
情報処理装置が、
予測分析に用いる予測モデルの学習に用いるデータセットの性質に基づいて、前記データセットのリークの可能性を検出する
情報処理方法。
(20)
予測分析に用いる予測モデルの学習に用いるデータセットの性質に基づいて、前記データセットのリークの可能性を検出する
処理をコンピュータに実行させるためのプログラム。
Claims (20)
- 予測分析に用いる予測モデルの学習に用いるデータセットの性質に基づいて、前記データセットのリークの可能性を検出するリーク検出部を
備える情報処理装置。 - 前記リーク検出部は、前記データセットに含まれるデータの発生日時又は取得日時と、予測分析を行うタイミングとの関係に基づいて、前記データセットのリークの可能性を検出する
請求項1に記載の情報処理装置。 - 前記リーク検出部は、前記発生日時又は前記取得日時が前記予測分析を行うタイミングより後であるデータを、リークの可能性があると判定する
請求項2に記載の情報処理装置。 - 前記リーク検出部は、前記データセットの識別性がある識別項目と前記予測分析の予測対象との相関関係に基づいて、前記データセットのリークの可能性を検出する
請求項1に記載の情報処理装置。 - 前記リーク検出部は、前記識別項目と前記予測対象との間に相関がある場合、前記識別項目にリークの可能性があると判定する
請求項4に記載の情報処理装置。 - 前記リーク検出部は、前記データセットの少なくとも1つの項目の値により前記データセットを複数のグループに分割し、グループK分割交差検証を行った結果に基づいて、前記項目のデータのリークの可能性を検出する
請求項1に記載の情報処理装置。 - 前記リーク検出部は、前記グループ間の予測精度の差に基づいて、前記項目のデータのリークの可能性を検出する
請求項6に記載の情報処理装置。 - 前記リーク検出部は、前記データセットの並び順を保持したまま、複数の異なるパターンで前記データセットを学習用データとテスト用データに分割し、前記パターン毎に前記予測モデルの検証を行った結果に基づいて、前記データセットのリークの可能性を検出する
請求項1に記載の情報処理装置。 - 前記リーク検出部は、前記パターン間の予測精度の差に基づいて、前記データセットの並び順のリークの可能性を検出する
請求項8に記載の情報処理装置。 - 前記リーク検出部は、前記データセットの項目名に基づいて、前記データセットのリークの可能性を検出する
請求項1に記載の情報処理装置。 - 前記リーク検出部は、前記データセットのドメインに基づいて、前記データセットのリークの可能性を検出する
請求項1に記載の情報処理装置。 - 前記データセットのリークの可能性の検出結果の表示を制御する表示制御部を
さらに備える請求項1に記載の情報処理装置。 - 前記表示制御部は、前記データセットにおいてリークの可能性がある項目又はデータ、及び、リークの可能性がある理由の表示を制御する
さらに備える請求項12に記載の情報処理装置。 - 前記データセットのリークの可能性の検出結果に基づいて、前記データセットの補正を行うリーク補正部を
さらに備える請求項1に記載の情報処理装置。 - 前記リーク補正部は、リークの可能性がある項目を前記データセットから除去する
請求項14に記載の情報処理装置。 - 前記リーク補正部は、前記データセットのリークの可能性の検出結果に対するユーザからのフィードバックに基づいて、前記データセットの補正方法を個人化する
請求項14に記載の情報処理装置。 - 前記データセット内の日時に関する項目の値が重複する度合いに基づいて、前記データセットを複数の系列の時系列データセットに分離する分離部を
さらに備える請求項1に記載の情報処理装置。 - 前記データセット内の日時に関する項目の値の間隔に基づいて、前記データセットのリサンプリングを行うリサンプリング部を
さらに備える請求項1に記載の情報処理装置。 - 情報処理装置が、
予測分析に用いる予測モデルの学習に用いるデータセットの性質に基づいて、前記データセットのリークの可能性を検出する
情報処理方法。 - 予測分析に用いる予測モデルの学習に用いるデータセットの性質に基づいて、前記データセットのリークの可能性を検出する
処理をコンピュータに実行させるためのプログラム。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021570019A JP7639710B2 (ja) | 2020-01-08 | 2020-12-25 | 情報処理装置、情報処理方法、及び、プログラム |
| EP20912940.2A EP4089598A4 (en) | 2020-01-08 | 2020-12-25 | Information processing device, information processing method, and program |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2020001592 | 2020-01-08 | ||
| JP2020-001592 | 2020-01-08 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021140957A1 true WO2021140957A1 (ja) | 2021-07-15 |
Family
ID=76788468
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2020/048727 Ceased WO2021140957A1 (ja) | 2020-01-08 | 2020-12-25 | 情報処理装置、情報処理方法、及び、プログラム |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4089598A4 (ja) |
| JP (1) | JP7639710B2 (ja) |
| WO (1) | WO2021140957A1 (ja) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2023033927A (ja) * | 2021-08-30 | 2023-03-13 | 株式会社日立製作所 | 水処理状況監視システム及び水処理状況監視方法 |
| JP2024022591A (ja) * | 2022-08-04 | 2024-02-16 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | 情報処理装置、情報処理方法、及び情報処理プログラム |
| JP2025506592A (ja) * | 2023-09-12 | 2025-03-13 | 南京大学 | 機械学習に基づく廃水生物処理プロセスの再構築方法及びシステム |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016212642A (ja) * | 2015-05-08 | 2016-12-15 | 富士電機株式会社 | アラーム予測装置、アラーム予測方法及びプログラム |
| JP2017016321A (ja) | 2015-06-30 | 2017-01-19 | ソニー株式会社 | 情報処理装置、情報処理方法、およびプログラム |
| JP2018155522A (ja) * | 2017-03-16 | 2018-10-04 | 株式会社島津製作所 | データ解析装置 |
| JP2019149030A (ja) * | 2018-02-27 | 2019-09-05 | 日本電信電話株式会社 | 学習品質推定装置、方法、及びプログラム |
| JP2020135054A (ja) * | 2019-02-13 | 2020-08-31 | 株式会社キーエンス | データ分析装置及びデータ分析方法 |
-
2020
- 2020-12-25 EP EP20912940.2A patent/EP4089598A4/en not_active Withdrawn
- 2020-12-25 JP JP2021570019A patent/JP7639710B2/ja active Active
- 2020-12-25 WO PCT/JP2020/048727 patent/WO2021140957A1/ja not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016212642A (ja) * | 2015-05-08 | 2016-12-15 | 富士電機株式会社 | アラーム予測装置、アラーム予測方法及びプログラム |
| JP2017016321A (ja) | 2015-06-30 | 2017-01-19 | ソニー株式会社 | 情報処理装置、情報処理方法、およびプログラム |
| JP2018155522A (ja) * | 2017-03-16 | 2018-10-04 | 株式会社島津製作所 | データ解析装置 |
| JP2019149030A (ja) * | 2018-02-27 | 2019-09-05 | 日本電信電話株式会社 | 学習品質推定装置、方法、及びプログラム |
| JP2020135054A (ja) * | 2019-02-13 | 2020-08-31 | 株式会社キーエンス | データ分析装置及びデータ分析方法 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4089598A4 |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2023033927A (ja) * | 2021-08-30 | 2023-03-13 | 株式会社日立製作所 | 水処理状況監視システム及び水処理状況監視方法 |
| JP7600057B2 (ja) | 2021-08-30 | 2024-12-16 | 株式会社日立製作所 | 水処理状況監視システム及び水処理状況監視方法 |
| JP2024022591A (ja) * | 2022-08-04 | 2024-02-16 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | 情報処理装置、情報処理方法、及び情報処理プログラム |
| JP7783238B2 (ja) | 2022-08-04 | 2025-12-09 | Nttドコモビジネス株式会社 | 情報処理装置、情報処理方法、及び情報処理プログラム |
| JP2025506592A (ja) * | 2023-09-12 | 2025-03-13 | 南京大学 | 機械学習に基づく廃水生物処理プロセスの再構築方法及びシステム |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4089598A4 (en) | 2023-02-08 |
| JP7639710B2 (ja) | 2025-03-05 |
| JPWO2021140957A1 (ja) | 2021-07-15 |
| EP4089598A1 (en) | 2022-11-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10387900B2 (en) | Methods and apparatus for self-adaptive time series forecasting engine | |
| US10255550B1 (en) | Machine learning using multiple input data types | |
| CN113704599A (zh) | 营销转化用户的预测方法、装置及计算机设备 | |
| Bier et al. | Variable-length multivariate time series classification using rocket: A case study of incident detection | |
| US9251527B2 (en) | System and method for providing personalized recommendations | |
| EP3779835A1 (en) | Device, method and program for analyzing customer attribute information | |
| US20240346531A1 (en) | Systems and methods for business analytics model scoring and selection | |
| JP7639710B2 (ja) | 情報処理装置、情報処理方法、及び、プログラム | |
| CN119130603A (zh) | 一种结合用户行为数据的兴趣推荐算法 | |
| CN116034379A (zh) | 使用深度学习和机器学习的活动性水平测量 | |
| KR102653187B1 (ko) | 웹크롤링 기반 학습용 데이터 전처리 전자 장치 및 그 방법 | |
| US20220230193A1 (en) | Information processing apparatus, information processing method, and program | |
| KR20240131244A (ko) | 인공지능 기반 브랜드 가치 평가 전자 장치 및 그 방법 | |
| KR20210126473A (ko) | 소비 데이터와 소셜 데이터를 이용한 소비동향 예측 지수 생성 방법과 이를 적용한 소비동향 예측 지수 생성 시스템 및 이를 위한 컴퓨터 프로그램 | |
| CN114329187B (zh) | 内容对象的推荐方法、装置、电子设备以及可读介质 | |
| US11531722B2 (en) | Electronic device and control method therefor | |
| US12222909B1 (en) | Systems and methods for identifying relevant data to use for enriching data structures in large datasets using detected metadata correlations | |
| KR20240131245A (ko) | 웹크롤링 기반 리뷰 분석 전자 장치 및 그 방법 | |
| Bao et al. | Dynamic financial distress prediction based on Kalman filtering | |
| TW201506827A (zh) | 用以從經整理及分析之一段時間內資料信號導出實質變化屬性以預測傳統預測因子之未來變化的系統及方法 | |
| KR102811070B1 (ko) | 이상 징후가 감지된 sns 데이터의 키워드 제공 방법 및 이상 징후가 감지된 sns 데이터의 키워드 제공 장치 | |
| CN119741032B (zh) | 净推荐值的分析方法、装置、设备及存储介质 | |
| Armykav et al. | Sentiment Analysis CNN Indonesia App Reviews on Play Store Using Naive Bayes Algorithm | |
| CN119831685B (zh) | 商品推荐方法、装置、非易失性存储介质及电子设备 | |
| KR20210112974A (ko) | 시장 집중도를 이용한 브랜드 지수 생성 시스템 및 방법과 이를 위한 컴퓨터 프로그램 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20912940 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021570019 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2020912940 Country of ref document: EP Effective date: 20220808 |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2020912940 Country of ref document: EP |