WO2023096887A1 - Techniques for combined data and execution driven pipeline - Google Patents
Techniques for combined data and execution driven pipeline Download PDFInfo
- Publication number
- WO2023096887A1 WO2023096887A1 PCT/US2022/050675 US2022050675W WO2023096887A1 WO 2023096887 A1 WO2023096887 A1 WO 2023096887A1 US 2022050675 W US2022050675 W US 2022050675W WO 2023096887 A1 WO2023096887 A1 WO 2023096887A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- workflow
- processes
- data
- query
- data processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0633—Workflow analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
Definitions
- Data processing workflow systems usually allow a user to manage the workflow, such that the user can set up the data processing workflow and store the workflow for retrieval and/or execution (e.g., via a database).
- the user To execute a workflow, the user must typically select input data (e.g., from a storage source) and trigger execution of the workflow using the input data. The workflow is then executed to generate output data.
- improved workflow management systems and various applications using the improved workflow management systems can be devised.
- Such improved systems and methods employ computational and Al processes to utilize hardware requirements (in building a workflow) and allow users to control instrumentation and samples tracking in a variety of applications, e.g., in chemistry workflows.
- the systems and methods may store data passed between processes in association with one or more processes in the workflow.
- a data processing workflow and data between the processes in the workflow may be stored in a graph database, as a pipeline execution record.
- the systems and methods enable the user to query the pipeline execution record in any suitable stage in the workflow by the structure of the workflow (e.g., a graph or a subgraph as search query).
- the systems and methods provided herein are particularly advantageous over convention systems such that data and workflows, which are complex and dynamically changing, may be scaled up.
- Some embodiments are directed to a system for workflow management, the system comprising at least one processor, the at least one processor is configured to: obtain a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; execute one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generate a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- Some embodiments are directed to a method for workflow management, the method comprising, using at least one processor: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- Some embodiments are directed to a non-transitory computer-readable media comprising instructions that, when executed, cause at least one processor to perform operations comprising: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- FIG. 1A is a diagram of a workflow management system for combined data and execution-driven pipeline, according to some embodiments.
- FIG. IB illustrates an example of search query for searching a data pipeline record by defining a graph search query, according to some embodiments.
- FIG. 2A illustrates multiple processes in an example data processing workflow defined by a user, according to some embodiments.
- FIG. 2B illustrates an example pipeline execution record associated with the data processing workflow of FIG. 2A, according to some embodiments.
- FIG. 3A illustrates an example graphical user interface that may be implemented in a workflow management system, according to some embodiments.
- FIG. 3B illustrates an example form builder that may be implemented in a workflow management system, according to some embodiments.
- FIG. 4A illustrates multiple processes in an example map reduction data processing workflow, according to some embodiments.
- FIG. 4B illustrates an example pipeline execution record associated with the data processing workflow of FIG. 4A, according to some embodiments.
- FIG. 5 illustrates another example data processing workflow in which a portion of a data processing workflow includes a sub-workflow that includes one or more processes, according to some embodiments.
- FIG. 6 illustrates a pipeline execution record associated with an example data processing workflow, according to some embodiments.
- FIG. 7 illustrates a pipeline execution record associated with another example data processing workflow combined with machine learning training and prediction, according to some embodiments.
- FIGS. 8 A and 8B illustrate example architectures of a system that may implement one or more components of a system for combined data and execution-driven pipeline, according to some embodiments.
- FIG. 9A illustrates an example application of molecule evaluation implemented in a data processing workflow, according to some embodiments.
- FIG. 9B illustrates an example of pipeline execution record in a graph database resulting from an execution of the data processing workflow shown in FIG. 9A, according to some embodiments.
- FIGS. 10A-10B illustrate an example graph database query based in part on a graph database resulting from an execution of a data processing workflow, according to some embodiments.
- FIG. 11 illustrates examples of multiple molecule evaluation workflows in parallel, according to some embodiments.
- FIG. 12 illustrates example process units of an example process in a data processing workflow, according to some embodiments.
- FIG. 13 illustrates an example recursive data processing workflow, according to some embodiments.
- FIG. 14 illustrates an example of reconfiguring a data processing workflow by adding one or more processes, according to some embodiments.
- FIG. 15 illustrates an example application of sample tracking in a laboratory implemented in a data processing workflow, according to some embodiments.
- FIG. 16 shows an illustrative implementation of a computer system that may be used to perform any of the aspects of the techniques and embodiments disclosed herein, according to some embodiments.
- Workflow management systems can be used to perform various tasks.
- a data processing workflow can be created and/or configured in a workflow management system to perform certain functions, such as predicting properties of a molecule.
- a workflow can be executed multiple times, each time using different input data, such as the structure of a different molecule that can be processed to predict the properties of the molecule.
- the input/output data is entirely decoupled from the workflow. While input data is specified for use with a pipeline (e.g., a flow of data through execution of a workflow or a portion of a workflow) in conventional approaches, the input data is not associated with either the pipeline or a particular execution of the pipeline (e.g., including the pipeline configuration for that execution, since the pipeline can change over time). Accordingly, the information for each particular execution of the pipeline is typically lost.
- a pipeline e.g., a flow of data through execution of a workflow or a portion of a workflow
- the final output data can be saved, it would need to be manually associated with a particular pipeline, which is typically not done but even if done, it just stores the result of the processing pipeline - the output data is not associated with the input data and/or the particular pipeline configuration that was executed to generate the output data. Further, none of the data/metadata is stored throughout execution of each of the pipeline components in association with the pipeline components themselves. Therefore, conventional systems do not provide a record of a particular pipeline and its associated execution(s). Thus, it is not possible to use conventional approaches to retrieve at a particular pipeline configuration or execution of that pipeline to see what data was generated step-by-step in the execution of the pipeline.
- each user is separately responsible for his/her own workflow and data, where no user can share other users’ workflow and/or data thereof, or repeat a portion of other users’ workflow.
- An example data processing workflow may include a plurality of processes that are linked in certain configurations. Each of the one or more processes may be associated with respective input data and output data, and the plurality of processes may be linked, such that output data of one process may be provided as input to one or more other processes in the workflow.
- the data processing workflow is executed, the data flows through the plurality of processes in the workflow as configured via the associated links.
- a specification of a data processing workflow may include data describing the configuration of the plurality of processes in the data processing workflow, such as how the plurality of processes are linked.
- the plurality of processes in a data processing workflow may be linked in serial, in parallel, or in a combination of serial and parallel manners.
- a specification of a workflow may be represented in a digital representation, such as a specification file describing the workflow.
- a specification file can be an XME file, a JSON file, a graph file, a flat file, and/or any other suitable format.
- the techniques described herein can therefore execute one or more processes of the plurality of processes of a data processing workflow, for example, based on the specification of the workflow.
- the techniques may create a pipeline execution record associated with executing a pipeline (e.g., flow of data through execution of one or more processes in the data processing workflow).
- the pipeline execution record may include, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- the pipeline execution record contains data that records one or more instances of execution of a data processing workflow (or a portion thereof) and input/output data or other execution metadata associated with the execution(s).
- a pipeline execution record may be stored in a database in any suitable format.
- the pipeline execution record may be represented in a graph and stored in a graph database.
- the techniques can also enable a user (who created the pipeline execution record), or other users (who did not create the pipeline execution record) to query the pipeline execution record to see how the data flowed through the pipeline, how the pipeline was executed, or the configuration of the pipeline for a particular execution.
- the techniques enable a user to search the pipeline execution record using a workflow query for one or more processes of a data processing workflow that match the workflow query.
- a workflow query may be represented in a sub-graph representing one or more processes of a portion of the graph database.
- one or more processes of the data processing workflow in the pipeline execution record may be matched to the workflow query.
- the techniques may output data associated with the one or more processes of the data processing workflow that match the workflow query, without re-executing the one or more processes of the data processing workflow.
- the techniques may execute one or more processes of the data processing workflow that are represented in the workflow query to generate the new output data.
- the techniques may display the pipeline execution record, for example, in a graph representation.
- the techniques may provide a user interface (e.g., graphical user interface) that receives user selection(s) defining at least a portion of the graph as the workflow query.
- the techniques may also provide a user interface that enables the user to define a data processing workflow.
- the techniques may provide a graphical user interface, and receive, via the graphical user interface, user selection(s) defining the one or more processes of the plurality of processes.
- the user selection(s) may include selection(s) of one or more processes from a library of user selectable processes.
- the resulting data processing workflow defined by the user may be stored in a specification such as a specification file described herein above.
- FIG. 1A is a diagram of a system for combined data and execution-driven pipeline, according to some embodiments.
- workflow management system 100 may include a workflow builder configured to obtain a specification of a data processing workflow, where the workflow may include one or more processes of a plurality of processes.
- the workflow builder may include a user interface configured to receive user selection(s) defining the one or more processes in the data processing workflow.
- each of the one or more processes may be associated with respective input data and output data, and the one or more processes may be linked, where output data of one process may be provided as input to one or more other processes in the workflow.
- FIG. 2A illustrates multiple processes in an example data processing workflow defined by a user, according to some embodiments.
- a workflow 200 may include three processes 202, 204, 206 (referred to as KUnits in this example without intending to be limiting).
- workflow 200 may be implemented in the workflow management system 100 (FIG. 1A).
- a process e.g., a KUnit
- process 202 may include a Forward library generation (Reaction Sage) that takes reactants list of products as input data.
- Reaction Sage Forward library generation
- the Forward library generation process is linked to a first machine learning (ML) inference process 204 and a second ML inference process 206, where each of the ML inference processes 204, 206 takes the output of the Forward library generation process as input data, and generate a respective output, which includes the properties of the products.
- ML machine learning
- FIG. 2B illustrates an example pipeline execution record 240 associated with the data processing workflow of FIG. 2A, according to some embodiments.
- each of the processes in the data processing workflow may be executed and generate intermediate data (to be provided as input data for other processes) or output data for the workflow.
- system 100 may execute one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof.
- the system may generate a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- the system may generate the pipeline execution record 240 that includes the input data and output data of each of the processes of the data processing workflow 200, e.g., Forward library generation process 202, the first ML inference process 204 and the second ML inference process 206.
- the associated data in the pipeline execution record 240 may include the input data to the forward library generation process 202, the output data of the first ML inference process 204 and the second ML inference process 206, and any intermediate data.
- the intermediate data may include the output data of the forward library generation process 202, which is input to the first ML inference process 204 and the second ML inference process 206.
- the pipeline execution record may be represented in a graph representation, where the input/output data associated with each process in the data processing workflow may be represented by a respective node, and where each process in the data processing workflow may be represented by a link between the nodes.
- the pipeline execution record may be represented by multiple nodes (shown as datalake folders) and multiple links between the nodes in a graph representation.
- the forward library generation process 202 may be represented by a link between the node for the input data and the node for the output data associated with the forward library generation process.
- the first ML inference process 204 may be represented by a link between the node associated with the output data of the forward library generation process 202 and the node associated with the final output data (e.g., properties of the products) associated with the first ML inference process.
- the second ML inference process 206 may be represented by a link between the node associated with the output data of the forward library generation process 202 and the node associated with the final output data (e.g., properties of the products) associated with the second ML inference process.
- the generated pipeline execution record may be stored in a workflow database 104, such as a graph database.
- each data node may store the input/output data associated with a respective process in the data processing workflow.
- each data node may include one or more pointers that reference to a data folder (e.g., data lake folder, as shown in FIG. 2B), where the data in the data folder may be downloaded from an external source, which can be from a software-based system (e.g., external database 106) and/or a hardware -based system.
- This representation of the pipeline execution record (e.g., in a graph representation) may enable the system to store and search any defined workflow together with the data associated with the workflow, where the data may result from a previous execution of the workflow using given input data (e.g., user defined input data).
- system 100 may include a workflow search engine configured to receive, from one or more users, a workflow query, and use the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query.
- the system may obtain from the pipeline execution record output data associated with the one or more processes of the data processing workflow that match the workflow query, and return the obtained output data to the user(s). This enables the user to quickly obtain the output data without the system re- executing any part of the data processing workflow.
- the user may identify one or more processes of a workflow the user would like to search.
- the user may use the identified one or more processes as a search query to search the associated pipeline execution record.
- the search query may be in a graph representation, such as a sub-graph shown in FIG. IB.
- FIG. IB illustrates an example of search query for searching a pipeline execution record by defining a graph search query, according to some embodiments.
- the system may display the pipeline execution record 140 available for searching.
- the pipeline execution record 140 may be displayed in a graph representation as described above.
- the system may receive a user selection defining a portion of the pipeline execution record (e.g., sub-graph 150).
- the sub-graph 150 may include one or more processes in the workflow.
- the system may search the pipeline execution record for one or more processes that match the one or more processes in the search query, and obtain the execution information (e.g., input/output and/or execution metadata) for the one or more processes that match the search query.
- the output of the search may return the output data, and/or any intermediate data associated with the one or more processes that match the search query.
- the system may obtain data (e.g., input/out data, or other metadata) associated with execution of the matched process(es) in the workflow, where the data may be stored in the workflow database (see FIG. 1A).
- the system may obtain new input data for the matched process(es) in the workflow and re-run the matched process(es) to generate new output data.
- the system may receive an input data query in addition to the one or more processes for search in pipeline execution record.
- the system may determine one or more processes in the workflow that match the workflow query.
- the system may determine if the input data query matches input data associated with the one or more processes of the data processing workflow that match the workflow query.
- the system may obtain the output data by retrieving output data associated with the one or more processes of the data processing workflow in the pipeline execution record that match the workflow query. Additionally, and/or alternatively, in response to determining a non-match of the input data query, the system may execute (re-run) the one or more processes of the data processing workflow that match the workflow query using the input data query, and generate the new output data. The system may obtain the new output data as the output data associated with the one or more processes of the data processing workflow that match the workflow query.
- the system may create/append new pipeline execution record that includes the new output data associated with the workflow and the new input data.
- the system may use a version control to manage different sets of data associated with a workflow (or a portion of a workflow). It is appreciated that the system main maintain a single pipeline execution record for each workflow, where the pipeline execution record may include multiple data sets each associated with an execution of a workflow (or a portion of a workflow).
- the system may store multiple pipeline execution records associated with a workflow, where each record is associated with an execution of the workflow (or a portion of the workflow).
- FIG. 3A illustrates an example graphical user interface that may be implemented in a workflow builder, according to some embodiments.
- the graphical user interface 300 shown in FIG. 3A may be implemented in the workflow builder shown in FIG. 1A.
- the workflow builder may receive user selection(s) for defining/building a data processing workflow and obtain a specification of the workflow.
- the user selection(s) may include selection(s) of one or more processes in a workflow and selection(s) of input/output data associated with each of the one or more processes. As shown in FIG.
- each of the boxes (302A, 302B, ...) in the user interface 300 of the workflow builder may be a process defining a function or a data block defining the input/output data associated with one or more processes.
- a process box or a data block may be selected by the user from a library of processes or data blocks, such as 320 in FIG. 3A.
- the user interface 300 may enable users to connect the data blocks and the one or more processes to build a data processing workflow.
- the user may select a first data block 302A and connect it to a process box “molecule” (302B), where the first data block 302A may be configured to convert molecule data (e.g., in SMILES format) to an output data representing the molecule that may be provided to the process “molecule” (302B).
- the process “molecule” (302B) may be configured to receive the molecule data as input and generate output data that may be provided to a process “molstate” (302C).
- the process “molstate” 302C may receive the output data from the “molecule” process 302B and generate output data that include molecular states.
- the user may continue selecting additional process(es) and data block(s) to build the workflow.
- the workflow may be executed to generate a series of output data.
- the output data may include energy property, excitation state, and coordinates of the input molecule.
- a graphical user interface is shown to enable a user to build a workflow, it is appreciated that other techniques may be possible.
- the system may allow the user to define a data processing workflow using script language, using drag-and-drop operation, via reading from a workflow file, or a combination thereof.
- the system enables flexibility of data types associated with a process in a data processing workflow.
- FIG. 3B illustrates an example form builder 350 that may be implemented in a workflow builder of FIG. 1A, according to some embodiments.
- the system may allow a user to define/edit data formats (schema) that may be associated with a process. This allows flexibility of changing data schema(s) in the future, or support of older schema(s) or migration of older schema(s) to new schema(s).
- FIGS. 4A-7 The various embodiments described herein may be implemented to build and search data processing workflows in various applications, such as the workflows shown in FIGS. 4A-7.
- FIGS. 4A-7 the workflows are represented by respective pipeline execution records, which will be described in detail.
- FIG. 4A illustrates multiple processes in an example map reduction data processing workflow 400, according to some embodiments.
- the workflow 400 may be built and executed in workflow management system 100 (FIG. 1A), in some embodiments.
- FIG. 4B illustrates an example pipeline execution record 420 associated with the data processing workflow 400 of FIG. 4A, according to some embodiments.
- FIG. 5 illustrates a pipeline execution record 500 associated with another example process for OLED, in which a portion of a data processing workflow includes a sub-workflow that includes one or more processes, according to some embodiments.
- FIG. 6 illustrates a pipeline execution record 600 associated with an example data processing workflows, according to some embodiments.
- FIG. 7 illustrates a pipeline execution record 700 associated with another example data processing workflow combined with machine learning training and prediction, according to some embodiments.
- data associated with a process in a data processing workflow may include data or datalake.
- data may include data itself or a pointer to a memory or external data source that stores the data.
- the datalake is an abstracted object for data storage that supports raw, native, or processed files (e.g., S3, Azure, Google Storage).
- the datalake itself may include metadata that allows for the use of search data (to a certain extent) without other components.
- a datalake may be available from a data storage device and/or platform (e.g., cloud storage) and can be downloadable locally (e.g., for faster execution).
- a process in a data processing workflow may be any of a machine learning process (e.g., machine learning training, machine learning inference), a molecule object creation process (e.g., having SMILES as input data and molecule object as output data), a molecule state process that defines an electronic state of the molecule that is needed for quantum chemistry processes (e.g., having molecule object as input data and molecule state as output data), a molecule-to-conformer process that will calculate 3D coordinates for number of lowest conformers of the input molecule (e.g., having molecule as input data and coordinates as output data), an geometry optimization using quantum chemistry density functional theory calculation (OPT-DFT) process (e.g., having coordinate and molecule state as input data, where the output data may include energy data, electronic data, coordinate data, or OPT-DFT datalake data for raw/unparsed outputs from the process).
- a machine learning process e.g., machine learning training, machine learning inference
- a molecule object creation process
- a process in a data processing workflow may also be a single point time dependent density functional theory quantum chemistry calculation to predict molecular electronic excited states SP-TDDFT (having coordinate and molecule state as input data, where the output data may have energy data, electronic data, excitation data, and/or SP-TDDFT datalake data).
- a process in a data processing workflow may also include a geometry optimization process (having coordinates as input data, where the output data may also include coordinates), or an excited state calculation process (having coordinates as input data, where the output data include excited states).
- a process in a data processing workflow may include a combination of multiple of processes. For example, as shown in FIG. 5, a process 540 in a pipeline execution record 500 may include multiple processes.
- FIGS. 8 A and 8B illustrate example architectures of a system that may implement one or more components of a system for combined data and execution-driven pipeline, according to some embodiments.
- the architecture in FIGS. 8 A and 8B may be implemented in a workflow management system, such as 100 in FIG. 1A.
- the architecture may be implemented as SaaS.
- container technologies may be used.
- the workflow management system (e.g., 100 in FIG. 1A) is implemented using event-driven architecture 800, which includes backend components 810.
- all backend components e.g., 810) are deployed in a container orchestration platform (e.g., Kubernetes Clusters), where a message broker 812 (e.g., RabbitMQ) serves as a communication bus.
- a container orchestration platform e.g., Kubernetes Clusters
- a message broker 812 e.g., RabbitMQ
- KUnits 814 are executed in containers and their statuses and lifecycle are managed by “Kloud services” 816, which implements all the logics of KUnit management, access to the database and datalake, scaling compute resources, role-based control management, error handling, etc.
- Kloud services which implements all the logics of KUnit management, access to the database and datalake, scaling compute resources, role-based control management, error handling, etc.
- services e.g., authentication service 802
- web application frontend (WAF) 840 may be communicative to the backend 810 via APIs for front-end interfaces that are provided via an ingress gateway 820 and protected by a Web Application Firewall 842.
- processes e.g., KUnits
- processes in a graph can be executed across different clusters (such as other cloud providers, and on-prem clusters) for users of various clusters to access.
- pipeline execution records may be generated across different clusters and can be shared by the users of other clusters.
- additional clusters e.g., 810A, 810B
- may similarly include their own processes e.g., KUnits 814A, 814B
- deploy their own “Kloud services” e.g., 816A, 816B
- messaging brokers e.g., RabbitMQ 812A, 812B
- Each additional cluster (e.g., 810A, 810B) and the cluster 810 is connected to a shared database cluster and/or datalake 870 via a respective bus (e.g., 880, 88OA, 88OB).
- a data and workflow manager 850 may be implemented.
- the data and workflow manager may include several components and layers and may be implemented for distributed and heterogeneous workflows that are executed with cloud computing.
- the data and workflow manager 850 may include a central component, such as the operational core 852, which is responsible for global orchestration of all processes and exchange of data.
- the core 852 has extensible interfaces to code access and data access.
- code interfaces 854 may provide a definition where code, workflows, and executables are stored (e.g., Git repositories, Container Repositories, Artifact Repositories).
- the core may deploy and execute specific computing infrastructure (e.g., cloud or on-premise computing).
- Data interfaces 856 provide specifications for where data is stored and translated to be consumed or used by processes (code).
- governance in the data and workflow manager may provide appropriate controls for data, code, and execution.
- all communications between components and services may be encrypted, authenticated, and authorized. These security schemes protect the system against threats that may exist both inside and outside the network so that the processes in the data processing workflow may be executed securely.
- both data and metadata are abstracted.
- the Knowledge Graph Database 858 may be the central data/metadata storage that combines information about data from different sources and the processes generated from this data. It stores the data in a structured format that is suitable to represent the relationships between processes and the data that they generate, along with additional metadata. Additionally, support of other data storage formats like Hadoop or Big Table can be connected via data interfaces 856.
- the main goal of the data and workflow manager is to support extensible abstraction for programming, execution, and query of heterogeneous workflows (mixed computational and experimental) and their associated data.
- the technical challenge of any collaborative platform that is scalable and intended for use by different organizations and people is to support ever-changing data formats (and schema), workflows and tools (e.g., both AI/ML and instrument).
- the system as described herein in various embodiments is designed according to the following assumptions and features: (i) Agility: flexibility of data schema, data and metadata schema will be changed in future, support for older schema or migration to new schema, end users can define their own schema, part of metadata collected automatically; (ii) Search and query: search by metadata and data, free text search, search by knowledge and workflow graphs, search by different users on the cloud; (iii) Security: data is immutable (append only), audit of data access and processing, authorization for data access, encryption of data; (iv) Reproducibility: Metadata has enough information that Data can be regenerated (with high probability or with a similar probabilistic distribution for random processes); and (v) Scalability: system can scale horizontally.
- machine learning, physical modeling, and experimental pipelines share many characteristics.
- the main difference for experimental pipelines is that they need to be synchronized with physical processes and objects in the lab. Therefore, using the same engine to execute and track both workflows provide the ability to rapidly introduce experimental and hypothesis-based workflows.
- This integration of workflows provides for richer searchability of the data and the construction of hypotheses and models from data generated by different workflows.
- Those workflows can be programmed by end users via simple YAML definition files, DSL (domain specific language), SDKs, and/or other tools.
- the workflows may be tested, executed, and monitored via command line interface (CLI), Jupyter notebooks, web interfaces (such as what is shown in FIG. 7), and/or other suitable tools.
- CLI command line interface
- Jupyter notebooks such as what is shown in FIG. 7
- web interfaces such as what is shown in FIG. 7
- Various embodiments as described above may store data in a graph database (e.g., workflow database in FIG. 1A), where all data are linked to processes, and can be generated and/or used (if any).
- Data stored in the graph database can be queried based on the workflow graph or its subgraphs as part of the query.
- the workflows may be hierarchical (without limit) and can be queried at different hierarchy levels. Data may be immutable, not-deleted.
- the various embodiments of the system described herein allow “history” or “data trail” query. Data may be abstracted, and may be indexed data with schema or without references to other data storages.
- Each process in the data processing workflow can auto-scale, can run on different clusters, or can be of any workload.
- workflows can communicate with instruments, user, run GPU or CPU programs, or interface with a developer environment.
- Pipeline execution records can be stored on the cloud and accessed (e.g., searched) by users on different clusters.
- FIGS. 9-15 further illustrate additional example applications of various embodiments of a workflow management system described herein in the present disclosure.
- FIG. 9A illustrates an example of molecule evaluation implemented in a workflow management system (e.g., 100 in FIG. 1A).
- a process 900 may be implemented as a KUnit, and configured to screen molecules for their properties and evaluate how easily molecules can be synthetized. Applications for this screening can be implemented in discovery of new materials and drugs.
- Process 900 may include three underlying processes 902, 904, 906. These processes may be serverless processes or shell KUnits.
- process 902 may be a process for property prediction
- process 904 may be a process for retrosynthesis
- process 906 may be a process for scoring.
- processes “Property prediction” and “Retrosynthesis” may be executed on GPU instance and may have different (and mutually exclusive) software dependencies.
- each of the processes 902, 904, 906 is shown with its inputs at the top of the rounded rectangle and outputs at the bottom, in some examples, where the arrows show the connectivity between the processes.
- the molecule 901 is sent to the first two processes 902, 904, and their results 902A and 904A are combined in the scoring function 906 to produce the overall score 906A.
- FIGS. 9-15 the diagrams are shown with small caps names for serverless/shell processes and all-caps names for graph processes.
- FIG. 9B illustrates an example of pipeline execution record in a graph database resulting from an execution of the workflow in FIG. 9A, according to some embodiments.
- the pipeline execution record 950 includes the underlying connectivity of the graph database resulting from a successful execution of a workflow (or a portion thereof).
- the pipeline execution record 950 may be represented in a graph, with the nodes in the graph corresponding to each of the data used and processes run (e.g., molecule 901A, prediction result 902A, retrosynthesis result 904A, processes 902, 904, 906), and the edges in the graph corresponding to associations between the nodes. For example, as shown in FIG.
- edges may be of different types: “input” edges (shown in dashed arrows) indicate the associations from data to process; “output” edges (shown in solid arrows) indicate the associations from process to data; and “contains” edges (shown in dotted arrows) indicate the associations from graphs to their child processes.
- FIGS. 10A-10B illustrate an example graph database query based in part on a graph database resulting from an execution of a workflow such as what is shown in FIG. 9A, according to some embodiments.
- FIG. 10A illustrates a graph database query 1001 based on part of the graph 1050 in FIG. 10B that was used to generate the data.
- FIG. 10B shows a pipeline execution record 1050 resulting from execution of the workflow 900 in FIG. 9A.
- pipeline execution record 1050 is represented in a graph, with the nodes in the graph corresponding to each of the data used and processes run (e.g., molecule 1001A, prediction result 1002A, retrosynthesis result 1004A, processes 1002, 1004, 1006), and the edges in the graph corresponding to associations between the nodes.
- the query is intended to be used to find property predictions and the score of a given molecule.
- a method for searching a pipeline execution record starts with a query input 1000 by searching all molecules that are input to processes named “Property Prediction” (e.g., 1002), this results in finding molecules (e.g., 1001A) as query output 1003.
- the method continues searching outputs of that process (e.g., 1002) that has output data of type “Prediction” (e.g., 1002A). This results in query output 1005.
- the method for searching continues with searching “Scoring Function” processes (e.g., 1006) which take the given data record “Prediction” (e.g., 1002A) as input, then searching output edges that lead to data type “Score” (e.g., 1006A) as query output 1007.
- the filters at each stage ensure that the query continues to flow through the desired route instead of deviating to any of the other processes connected by the edges. In this example in FIG.
- the search returns the following data records “Molecule” (e.g., 1003), “Predictions” (e.g., 1005), and “Score” (e.g., 1007), which can be stored, for example, in a table.
- “Molecule” e.g., 1003
- Predictions e.g., 1005
- Score e.g., 1007
- FIG. 11 illustrates examples of multiple molecules in parallel, according to some embodiments.
- the various embodiments of the workflow management system e.g., 100 in FIG. 1A
- the workflow management system supports a parallelized “map-reduce” framework.
- “bulk molecule evaluation” process 1100 e.g., implemented as a KUnit
- the molecules 1100A are split apart, and each “Molecule Evaluation” graph (e.g., 1150A, 1150B, ..., 1150N) runs in parallel, before all of the resulting predictions (e.g., 1100B) are collected together in the workflow management system.
- each “Molecule Evaluation” graph e.g., 1150A, 1150B, ..., 1150N
- all of the resulting predictions e.g., 1100B
- FIG. 12 illustrates example processing units of an example process 1200 in a workflow, according to some embodiments.
- the example process 1200 may be any of the one or more processes (e.g., KUnits) in a data processing workflow such as what is described herein in the present disclosure.
- Individual KUnits can have multiple possible outputs (e.g., 1210, 1212) that are sent via different channels (e.g., 1210A, 1212A) depending on some condition in their computation.
- a serverless KUnit is used to validate user-provided file uploads (e.g., 1202). If the upload (e.g., 1202) is valid, then the output 1210 is a processed version of the uploaded data 1202.
- the output channels e.g., 1210A, 1212A
- the output channels can cause the other KUnits to run subsequently based on the output of process 1200.
- FIG. 13 illustrates an example recursive workflow, according to some embodiments.
- the recursive workflow 1300 may be implemented in various embodiments of the workflow management system described herein (e.g., 100 in FIG. 1A).
- workflow 1300 implements the Gaussian elimination algorithm for determining the greatest common divisor of two integers.
- the workflow 1300 has separate processes (e.g., KUnits) for the subroutines of comparing the integers (process 1302) and subtracting the integers (process 1304).
- two integers are compared, and if they are unequal, the difference 1304A between the larger integer 1302B and the smaller integer 1302C is recursively compared with the smaller one 1302C (through connections 1306, 1308). This process is repeated until the two integers (e.g., 1301A, 1301B) are equal, and that number 1302A is the greatest common divisor.
- the recursive workflows includes one or more connections that go backwards from one child to another (or from one child to itself), such as connections 1306, 1308. This causes the “Compare Integers” process 1302 to repeat for whenever it gets triggered.
- FIG. 14 illustrates an example of reconfiguring a workflow by adding one or more processes, according to some embodiments.
- a workflow 1400 may be implemented in various embodiments of the workflow management system described herein (e.g., 100 in FIG. 1A) and can be reconfigured, for example, by adding one or more processes.
- a user interaction process 1402 e.g., a dedicated KUnit, “K-Interact” may be added to workflow 1400. Interactions can be simple notifications, collecting structured inputs from users. Some of the input channels (e.g., 1401A, 1401B, 1401C) and output channels (e.g., 1402A) of the interaction process 1402 are shown.
- a “switch” process 1404 may be included that handles the logic of forms 1403A that ask the user to choose from a list of options (e.g., 1404A, 1404B). Additionally, and/or alternatively, a “compose” process 1406 is included that produces the inputs (e.g., 1406A, 1406B, 1406C) to the K-Interact form given input 1405 A.
- FIG. 15 illustrates an example of sample tracking in a laboratory implemented in a workflow 1500, according to some embodiments.
- workflow 1500 may be implemented in various embodiments of the workflow management system described herein (e.g., 100 in FIG. 1A).
- the workflow management system as described herein can simplify the implementation of a complex process, such as “transfer sample aliquot.”
- “transfer sample aliquot” can be fairly complex in tracking the origins of each sample: When a portion of the source sample is added to the destination sample, the origins for the new source sample should not include the destination sample, but those for the new destination sample should include both of the original samples.
- the workflow management system as described in the present disclosure can accomplish such workflow 1500 by utilizing a graph with two children (e.g., 1502, 1504) that produce (e.g., 1502) and consume an ephemeral sample (e.g., 1504).
- a portion of the source sample 1501A is added to the destination sample 1501B to generate a new destination sample 1504A, which includes both the source sample 1501A and destination sample 1501B, whereas the origins for the new source sample 1502A do not include the destination sample 1501B.
- the systems and methods described in various embodiments of FIGS. 1-15 may provide advantages over other conventional systems to allow a unified system for computational and physical processes.
- the systems and methods describe above may allow computational and Al processes to utilize a variety of hardware requirements (in building a workflow) and allow users to control instrumentation and samples tracking.
- data passed between processes is stored in association with one or more processes in the workflow.
- a data processing workflow and data between the processes in the workflow may be stored in a graph database, as a pipeline execution record.
- the pipeline execution record for the workflow allows the user to query the data in any suitable stage in the workflow by the structure of the workflow (e.g., a graph or a sub-graph as search query).
- the systems and methods provided herein are particularly advantageous over convention systems such that data and chemistry workflows, which are complex and dynamically changing, may be scaled up.
- FIG. 16 shows an illustrative implementation of a computer device that may be used to perform any of the aspects of the techniques and embodiments disclosed herein, according to some embodiments.
- the computer device 1600 may be installed in system 100 of FIG. 1A.
- the computer device 1600 may be configured to perform various methods and acts as described in FIGS. 1A-15.
- the computer device 1600 can implement the workflow management system 100 or any components thereof (e.g., the workflow builder 102, workflow search engine 108) or any device associated with the users as shown in FIG. 1A .
- the computing device 1600 may include one or more computer hardware processors 1602 and non-transitory computer- readable storage media (e.g., memory 1604 and one or more non-volatile storage devices 1606).
- the processor(s) 1602 may control writing data to and reading data from (1) the memory 1604; and (2) the non-volatile storage device(s) 1606. To perform any of the functionality described herein, the processor(s) 1602 may execute one or more processor executable instructions stored in one or more non-transitory computer-readable storage media (e.g., the memory 1604), which may serve as non-transitory computer-readable storage media storing processor-executable instructions for execution by the processor(s) 1602.
- the computing device 1600 also includes network I/O interface(s) 1608 and user I/O interfaces 1610.
- the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of numerous suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a virtual machine or a suitable framework.
- inventive concepts may be embodied as at least one non- transitory computer readable storage medium (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, etc.) encoded with one or more programs that, when executed on one or more computers or other processors, implement the various embodiments of the present invention.
- the non- transitory computer-readable medium or media may be transportable, such that the program or programs stored thereon may be loaded onto any computer resource to implement various aspects of the present invention as discussed above.
- program “software,” and/or “application” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of embodiments as discussed above. Additionally, it should be appreciated that according to one aspect, one or more computer programs that when executed perform methods of the present invention need not reside on a single computer or processor, but may be distributed in a modular fashion among different computers or processors to implement various aspects of the present invention.
- Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices.
- program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- functionality of the program modules may be combined or distributed as desired in various embodiments.
- data structures may be stored in non-transitory computer-readable storage media in any suitable form.
- Data structures may have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a non-transitory computer-readable medium that convey relationship between the fields.
- any suitable mechanism may be used to establish relationships among information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationships among data elements.
- Various inventive concepts may be embodied as one or more methods, of which examples have been provided. The acts performed as part of a method may be ordered in any suitable way.
- embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
- Some embodiments are directed to a system for workflow management, the system comprising at least one processor, the at least one processor is configured to: obtain a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; execute one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generate a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- the process data record includes one or more pointers that reference to data in one or more external data sources.
- the process data records includes a plurality of datasets each associated with a respective execution of the one or more processes of the plurality of processes of the workflow.
- obtaining the specification of the data processing workflow comprises: receiving, via a graphical user interface, user selection defining the one or more processes of the plurality of processes.
- the user selection includes selection of one or more processes from a library of user selectable processes.
- the specification of the data processing workflow comprises a script file.
- the pipeline execution record is stored in a graph database.
- the at least one processor is further configured to: receive a workflow query; use the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query; and obtain output data associated with the one or more processes of the data processing workflow that match the workflow query.
- the at least one processor is further configured to: receive an input data query; determine if the input data query matches input data associated with the one or more processes of the data processing workflow that match the workflow query; in response to determining a match of the input data query, obtain the output data by retrieving output data associated with the one or more processes of the data processing workflow in the pipeline execution record; and in response to determining a non-match of the input data query: (1) execute the one or more processes of the data processing workflow that match the workflow query to generate the new output data; and (2) obtain the new output data as the output data associated with the one or more processes of the data processing workflow that match the workflow query.
- the pipeline execution record is stored in a graph database; and the workflow query comprises a sub-graph.
- the at least one processor is further configured to: display the pipeline execution record in a graph; and receive user selection defining at least a portion of the graph as the workflow query.
- the process data record comprises one or more pointers that reference to data in one or more external data sources
- obtaining the output data associated with the one or more processes of the data processing workflow that match the workflow query comprises: retrieving the output data from at least one of the one or more external data sources using at least one of the one or more pointers.
- Some embodiments are directed to a method for workflow management, the method comprising, using at least one processor: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- the process data record includes: one or more pointers that reference to data in one or more external data sources; or optionally, a plurality of datasets each associated with a respective execution of the one or more processes of the plurality of processes of the workflow.
- obtaining the specification of the data processing workflow comprises: receiving, via a graphical user interface, user selection defining the one or more processes of the plurality of processes.
- the user selection includes selection of one or more processes from a library of user selectable processes.
- the method further comprises: receiving a workflow query; using the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query; and obtaining output data associated with the one or more processes of the data processing workflow that match the workflow query.
- the method further comprises: receiving an input data query; determining if the input data query matches input data associated with the one or more processes of the data processing workflow that match the workflow query; in response to determining a match of the input data query, obtain the output data by retrieving output data associated with the one or more processes of the data processing workflow in the pipeline execution record; and in response to determining a non-match of the input data query: (1) executing the one or more processes of the data processing workflow that match the workflow query to generate the new output data; and (2) obtaining the new output data as the output data associated with the one or more processes of the data processing workflow that match the workflow query.
- Some embodiments are directed to a non-transitory computer-readable media comprising instructions that, when executed, cause at least one processor to perform operations comprising: obtaining a specification of a data processing workflow comprising a plurality of processes, wherein each process is associated with input data and output data, and each process is further linked to one or more other processes of the workflow; executing one or more processes of the plurality of processes of the workflow to generate, for each of the one or more processes, input data, output data, execution metadata, or some combination thereof; and generating a pipeline execution record, wherein the pipeline execution record comprises, for each of the one or more executed processes, a process data record comprising the associated input data, output data, execution metadata, or some combination thereof.
- the operations further comprise: receiving a workflow query; using the workflow query to search the pipeline execution record for one or more processes of the data processing workflow that match the workflow query; and obtaining output data associated with the one or more processes of the data processing workflow that match the workflow query.
- a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Software Systems (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Quality & Reliability (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Probability & Statistics with Applications (AREA)
- Game Theory and Decision Science (AREA)
- Mathematical Physics (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Fuzzy Systems (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Computer Hardware Design (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Stored Programmes (AREA)
Abstract
Description
Claims
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202280088786.XA CN118541680A (en) | 2021-11-23 | 2022-11-22 | Techniques for combined data and execution driven pipelining |
| US18/709,776 US20250045290A1 (en) | 2021-11-23 | 2022-11-22 | Techniques for combined data and execution driven pipeline |
| EP22899329.1A EP4420010A4 (en) | 2021-11-23 | 2022-11-22 | Methods for combined data and execution-driven pipeline |
| JP2024527836A JP2024540443A (en) | 2021-11-23 | 2022-11-22 | Techniques for Combining Data and Execution-Driven Pipelines |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202163282584P | 2021-11-23 | 2021-11-23 | |
| US63/282,584 | 2021-11-23 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023096887A1 true WO2023096887A1 (en) | 2023-06-01 |
Family
ID=86540261
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2022/050675 Ceased WO2023096887A1 (en) | 2021-11-23 | 2022-11-22 | Techniques for combined data and execution driven pipeline |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20250045290A1 (en) |
| EP (1) | EP4420010A4 (en) |
| JP (1) | JP2024540443A (en) |
| CN (1) | CN118541680A (en) |
| WO (1) | WO2023096887A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230171274A1 (en) * | 2021-11-30 | 2023-06-01 | International Business Machines Corporation | System and method to perform governance on suspicious activity detection pipeline in risk networks |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180067732A1 (en) * | 2016-08-22 | 2018-03-08 | Oracle International Corporation | System and method for inferencing of data transformations through pattern decomposition |
| US20190258524A1 (en) * | 2012-02-14 | 2019-08-22 | Amazon Technologies, Inc. | Providing configurable workflow capabilities |
| US20200104401A1 (en) * | 2018-09-28 | 2020-04-02 | Splunk Inc. | Real-Time Measurement And System Monitoring Based On Generated Dependency Graph Models Of System Components |
-
2022
- 2022-11-22 JP JP2024527836A patent/JP2024540443A/en active Pending
- 2022-11-22 EP EP22899329.1A patent/EP4420010A4/en active Pending
- 2022-11-22 US US18/709,776 patent/US20250045290A1/en active Pending
- 2022-11-22 CN CN202280088786.XA patent/CN118541680A/en active Pending
- 2022-11-22 WO PCT/US2022/050675 patent/WO2023096887A1/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190258524A1 (en) * | 2012-02-14 | 2019-08-22 | Amazon Technologies, Inc. | Providing configurable workflow capabilities |
| US20180067732A1 (en) * | 2016-08-22 | 2018-03-08 | Oracle International Corporation | System and method for inferencing of data transformations through pattern decomposition |
| US20200104401A1 (en) * | 2018-09-28 | 2020-04-02 | Splunk Inc. | Real-Time Measurement And System Monitoring Based On Generated Dependency Graph Models Of System Components |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4420010A4 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230171274A1 (en) * | 2021-11-30 | 2023-06-01 | International Business Machines Corporation | System and method to perform governance on suspicious activity detection pipeline in risk networks |
| US12407700B2 (en) * | 2021-11-30 | 2025-09-02 | International Business Machines Corporation | System and method to perform governance on suspicious activity detection pipeline in risk networks |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2024540443A (en) | 2024-10-31 |
| EP4420010A4 (en) | 2025-08-27 |
| CN118541680A (en) | 2024-08-23 |
| US20250045290A1 (en) | 2025-02-06 |
| EP4420010A1 (en) | 2024-08-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Belcastro et al. | Programming big data analysis: principles and solutions | |
| Huber et al. | AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance | |
| Belcastro et al. | Programming models and systems for big data analysis | |
| Acharjya et al. | A survey on big data analytics: challenges, open research issues and tools | |
| Parsian | Data algorithms: recipes for scaling up with Hadoop and Spark | |
| Alrifai et al. | A hybrid approach for efficient Web service composition with end-to-end QoS constraints | |
| Aridhi et al. | Big graph mining: Frameworks and techniques | |
| Leemans et al. | Discovery of frequent episodes in event logs | |
| Ferreira et al. | Improving process models by mining mappings of low-level events to high-level activities | |
| US20220188691A1 (en) | Machine Learning Pipeline Generation | |
| Braun et al. | MapReduce-based complex big data analytics over uncertain and imprecise social networks | |
| Belcastro et al. | ParSoDA: high-level parallel programming for social data mining | |
| Lehmann et al. | Managing Geospatial Linked Data in the GeoKnow Project. | |
| Abid et al. | Semantic web service composition using semantic similarity measures and formal concept analysis | |
| Tan et al. | Building scientific workflow with taverna and bpel: A comparative study in cagrid | |
| Cesario et al. | Programming knowledge discovery workflows in service‐oriented distributed systems | |
| US20250045290A1 (en) | Techniques for combined data and execution driven pipeline | |
| Gesing et al. | Workflows in a dashboard: a new generation of usability | |
| Ahn et al. | EDISON‐DATA: A flexible and extensible platform for processing and analysis of computational science data | |
| Van Zyl et al. | Earth observation scientific workflows in a distributed computing environment | |
| Iatropoulou et al. | Towards platform-agnostic and autonomous orchestration of big data services | |
| Durdy et al. | The Liverpool materials discovery server: a suite of computational tools for the collaborative discovery of materials | |
| Behrens et al. | DataStorm: Coupled, Continuous Simulations for Complex Urban Environments | |
| Lehmann et al. | The geoknow handbook | |
| Petcu | Identifying cloud computing usage patterns |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22899329 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2024527836 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022899329 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2022899329 Country of ref document: EP Effective date: 20240520 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202280088786.X Country of ref document: CN |