WO2024210902A1 - Large scale data discovery with near real time data cataloguing and detailed lineage - Google Patents
Large scale data discovery with near real time data cataloguing and detailed lineage Download PDFInfo
- Publication number
- WO2024210902A1 WO2024210902A1 PCT/US2023/017830 US2023017830W WO2024210902A1 WO 2024210902 A1 WO2024210902 A1 WO 2024210902A1 US 2023017830 W US2023017830 W US 2023017830W WO 2024210902 A1 WO2024210902 A1 WO 2024210902A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- storage systems
- lineage
- metadata
- catalog
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
- G06F16/287—Visualization; Browsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0633—Workflow analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/12—Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/51—Discovery or management thereof, e.g. service location protocol [SLP] or web services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/55—Push-based network services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/101—Access control lists [ACL]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/04—Protocols specially adapted for terminals or networks with limited capabilities; specially adapted for terminal portability
Definitions
- This description relates to large scale data discovery with near real time data cataloguing and detailed lineage, and method of using the same.
- Data lineage tools show the evolution of data over time via metadata.
- Data lineage identifies the flow of the data.
- the data lineage is able to identify the creation of the data, changes made to the data, and the end target containing the data.
- the data lineage thus identifies where the data came from, what processes were involved in the creation of the data, who was involved in the creation and transformation, and where the data is currently located.
- a Data Catalog uses the same information to create a searchable inventory of data assets in an organization. Using the data catalog and lineage enables a user to discover the data that is present in the organization.
- a data object has attributes, such as who is the owner, what is the structure of the data object, who generated the data object, who is consuming the data object in the downstream, and what downstream process is being used to generate the data object.
- Apache Atlas provides native support to Hadoop based systems.
- Apache RangerTM is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Atlan Another product is Atlan, which joins together metadata from various sources to create a unified data discovery, cataloging, lineage, and governance experience across data assets.
- Additional components, such as Spark helps build the data lineage using the connector with Apache Atlas.
- a method for providing a data platform includes receiving data from one or more sources, writing the received data to a data storage systems, processing the received data at the data storage systems using sensors at the data storage systems, sending metadata and data definitions associated with the data to a data service in near real-time, generating, at the data service, a data catalog based on the metadata and data definitions associated with the data, creating, at the data service, data lineage for the data, and presenting the data catalog and data lineage at a data discovery interface.
- a data platform includes a memory storing computer- readable instructions, and a processor connected to the memory, wherein the processor is configured to execute the computer-readable instructions to perform operations to receive data from one or more sources, write the received data to a data storage systems, process the received data at the data storage systems using sensors at the data storage systems, send metadata and data definitions associated with the data to a data service in near real-time, generate, at the data service, a data catalog based on the metadata and data definitions associated with the data, create, at the data service, data lineage for the data, and present the data catalog and data lineage at a data discovery interface.
- a non-transitory computer-readable media having computer-readable instructions stored thereon, which when executed by a processor causes the processor to perform operations including receiving data from one or more sources, writing the received data to a data storage systems, processing the received data at the data storage systems using sensors at the data storage systems, sending metadata and data definitions associated with the data to a data service in near real-time, generating, at the data service, a data catalog based on the metadata and data definitions associated with the data, creating, at the data service, data lineage for the data, and presenting the data catalog and data lineage at a data discovery interface.
- FIG. 1 illustrates a Data Platform according to at least one embodiment.
- Fig. 2 illustrates processes of the data platform according to at least one embodiment.
- Fig. 3 A illustrates a Data Computer Interface without Synchronous Mode.
- Fig. 3B illustrates a Data Compute Interface with Synchronous Mode according to at least one embodiment.
- Fig. 4 illustrates high availability of Apache Atlas application pods according to at least one embodiment.
- FIG. 5 illustrates simple Data Linage as provided by existing components.
- Fig. 6 illustrates capture of discrete processes in Detailed Data Lineage according to at least one embodiment.
- Fig. 7 illustrates Data Pattern Detection and Optimization according to at least one embodiment.
- Fig. 8 is a flowchart of a method for providing large scale data discovery with near realtime data cataloging and detailed lineage according to at least one embodiment.
- Fig. 9 is a high-level functional block diagram of a processor-based system according to at least one embodiment.
- Embodiments described herein describes examples for implementing different features of the provided subject matter. Examples of components, values, operations, materials, arrangements, or the like, are described below to simplify the present disclosure. These are, of course, examples and are not intended to be limiting. Other components, values, operations, materials, arrangements, or the like, are contemplated.
- the formation of a first feature over or on a second feature in the description that follows include embodiments in which the first and second features are formed in direct contact and include embodiments in which additional features are formed between the first and second features, such that the first and second features are unable to make direct contact.
- the present disclosure repeats reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in dictate a relationship between the various embodiments and/or configurations discussed.
- spatially relative terms such as “beneath,” “below,” “lower,” “above,” “upper” and the like, are used herein for ease of description to describe one element or feature’s relationship to another element(s) or feature(s) as illustrated in the figures.
- the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures.
- the apparatus is otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein likewise are interpreted accordingly.
- the foregoing terms are utilized interchangeably in the subject specification and related drawings.
- access point refers to a wireless network component or apparatus that serves and receives data, control, voice, video, sound, gaming, data-streams or signaling-streams from UE.
- a method for providing a data platform includes receiving data from one or more sources, writing the received data to a data storage systems, processing the received data at the data storage systems using sensors at the data storage systems, sending metadata and data definitions associated with the data to a data service in near real-time, generating, at the data service, a data catalog based on the metadata and data definitions associated with the data, creating, at the data service, data lineage for the data, and presenting the data catalog and data lineage at a data discovery interface.
- a data platform receives input at an Authentication and Middleware Interface from one or more sources.
- Data platform Data Governance, Data Storage Layer, Data Compute Interface, Data Discovery Interface, and Query Interface.
- a Data Compute Interface, a Data Service Interface, and a Query Interface are provided.
- Data Compute Interface includes a Data Generation and Transformation Layer that writes or reads data toward Data Storage Layer.
- Data Governance Access Control Layer
- Data Storage Layer controls access to the Data Storage Layer by determining access rights to the Data Storage Layer. New data and changes are captured by Sensors in the Data Storage Layer and are pushed to Data Compute Interface.
- Data Compute Interface pushes Metadata, creates Data Lineage, and performs Data Audits. This information along with data definitions are pushed to Data Service Interface.
- Data Service Interface includes a Data Discovery Interface and presents Data Catalog and Data Lineage. Duplicates in the Data Lineage and the Data Catalog are determined and eliminated. A user is able to search and discover data of interest across the data by accessing Data Discovery Interface at Data Service Interface. Data Compute Interface pulls Metadata Change Events. Data Lineage is created to present detailed, granular lineage information including details for use in performing complex data transformations. Data Service Interface receives input at Data Discovery Interface for querying the metadata associated with the data at the Data Storage Systems. Tag data, and Classification and Categorizations data are associated with the data.
- a dataset is identified by Data Query Interface at the Data Storage Layer based on the input to Data Discovery Interface of Data Service Interface, and the metadata and the data definitions fetched from the Data Service Interface.
- Data governance Access Control Layer
- Data governance controls access to the Data Storage Layer by requesting access rights to the Data Storage Layer.
- a user is able to generate a query for reading the identified dataset.
- Data is then fetched directly from the Data Storage Layer through the Data Query Layer.
- Embodiments described herein thus provide certain advantages. These advantages include one or more of providing a one-stop solution for data usage lifecycle, providing discoverability of data to end user in near real time because of the minimization of manual intervention, enabling informed business decisions to be made due to the low level lineage with detailed steps involved in data transformation and other metadata details that includes schema, provides ease of use due to automatic creation of the Data Catalog and integration with a Query Layer, or provisioning of proper data meaning and accountability.
- the capability for pattern detection helps any user to detect/search any data based on a type of operation (or a set of operations) that lead to the generation of a particular data. This knowledge helps users/applications avoid duplication of datasets and the operations that are used to generate them there-by reducing cost & effort for the same.
- FIG. 1 illustrates a Data Platform 100 according to at least one embodiment.
- an Authentication and Middleware Interface 110 receives input from one or more sources 112, for example, through a Web UI 114 and/or an Onboarding Interface 116.
- a Data Compute Interface 120 includes a Data Generation and Transformation Layer that writes or reads data toward Data Storage Layer 130.
- Configure and Initiate Pipelines 117 and Initiate Data Migration Flows 118 are configured to ingest data at Data Compute Interface 120.
- Data Compute Interface 120 uses Spark/NiFi Jobs 122 to process and distribute data over resources, and Data Definition Capture Jobs 124 to pull that data from stream queues. Data Compute Interface 120 is thus able to Read and Write Data 126 via Data Storage Layer 130.
- Data Storage Layer 130 implements a data storage systems that stores datasets. Different data storage technologies are able to be implemented, such as MinlO 131, Yugabyte 132, MySQL 133, Kafka 134, etc.
- Data Definition Language (DDL) streams 136 presents data to the data storage 131-134 and Sensors 138 categorizes data into new data and changed data, and captures change information that is pushed to Data Computer Interface 120 and then to Data Service Interface 140.
- Data Governance (Access Control Layer) 150 controls access to the Data Storage Layer 130. Data Compute Interface 120 pulls Metadata Change Events 128 from Data Storage Layer 130.
- Data Service Interface 140 presents a Data Discovery Interface 142, a Data Catalog 144, and Data Lineage 146. Access Control List (ACL) Plugins 147 provide rules that concern access to the Data Storage Layer 130 for Reading/Writing Data 148.
- Data Query Interface 160 includes Query-as-a-Service (QaaS) 162 to enable users to process and query data.
- Virtual Warehouse Layer 164 provides an abstracted view of the data at the data storage as well as tools and APIs to extract data.
- a user that wants to Search and Discover Data of Interest 170 across the data accesses Data Discovery Interface 142 at Data Service Interface 140. Users who are Data Owners are able to Tag Data, and Create Classification and Categorizations 171 associated with the data.
- Data Query Interface 160 is able to Create Views Over Existing Data 172 and supports Queries Over Multiple Data Stores 173.
- Create First Level Reports For Analytics 174 is associated with the dataset identified at the data storage is able to be created based on a query.
- Data Compute Interface 120 pushes Metadata and Creates Data Lineage and Data Audits 180 towards Data Service Interface 140.
- the metadata and data definitions associated with the data received from the one or more data storage systems are provided to Data Service Interface 140 in near real-time.
- non-real-time refers to a response time of 1 second or more and near real-time refers to responses down to 10 milliseconds (ms).
- Data Query Interface 160 fetches Metadata and Data Definitions 184 from Data Service Interface 140. Based on approved queries, Data Query Interface 160 can receive Read Query Data 166 for reading
- FIG. 2 illustrates processes of the data platform 200 according to at least one embodiment.
- an Authentication and Middleware Interface 210 receives input from one or more sources 212, for example , through a Web UI 214 and/or an Onboarding Interface 216.
- a Data Compute Interface 220 includes a Data Generation and Tranformation Layer that writes or reads data toward Data Storage Layer 230.
- Data governance (Access Control Layer) 250 controls access to the Data Storage Layer 230 by determining access rights to the Data Storage Layer 230.
- Configure and Initiate Pipelines 217 and Initiate Data Migration Flows 218 are configured to ingest data into data sources in Data Storage Layer 230 where Sensors 238 capture these changes.
- Data is provided to Data Compute Interface 220.
- Data Compute Interface 220 includes a Data Generation and Transformation Layer that writes or reads data toward Data Storage Layer 230.
- Data Compute Interface includes Spark/NiFi j obs 222 and Data Definition Capture Jobs 224 to pull that data from stream queues.
- Data Definition Capture Jobs 224 include preconfigured jobs for the capture of data definition that pull data from the stream queue.
- Data Definition Capture Jobs 234 extract information and convert it to unified format which Data Discovery Interface 224 of Data Service Interface 240 understands.
- Data Storage Layer 230 includes multiple data storage systems that stores datasets. Different data storage technologies are able to be implemented, such as MinlO 231, Yugabyte 232, MySQL 233, Kafka 234, etc.
- Data Storage Layer 230 stores datasets using Data Definition Language (DDL) streams 236.
- DDL is a subset of SQL that is a language for describing data and its relationships in a database, e.g., MinlO 231, Yugabyte 232, etc.
- ACL Plugins 247 provide rules that concern access to the Data Storage Layer 230 for Reading/Writing Data 248.
- Query Interface 360 includes Query-as-a-Service (QaaS) 262 to enable users to query data.
- Virtual Warehouse Layer 264 provides an abstracted view of the data at the data storage as well as tools and APIs to extract data.
- Sensors 238 are written for the database that categorizes data into new data and changed data, and captures change information that is pushed to Data Compute Interface 220 and then to Data Service Interface 240. Sensors 238 detect changes in data and then push the changes to the Data Compute Interface 220 and then to Data Service Interface 240. For example, a file that is written to the database, and Sensors 238 sends the file to Data Compute Interface 220. Metadata Capture Services pull these and create Data Catalog 244 and Data Lineage 246. Data Lineage 246 is created to present detailed, granular lineage information including details such data classifications and business topology that are tagged to the metadata for use in performing complex data transformations.
- Data Compute Interface 220 pushes Metadata Create Data Lineage and Data Audits 280 towards Data Service Interface 240 in near real-time. Data definitions are also able to be provided by Data Compute Interface 220 to Data Service Interface 240.
- Data Service Interface 240 includes a Data Discovery Interface 242 and presents Data Catalog 244 and Data Lineage 246. In response to the file being generated from an already existing file, Data Lineage 246 is captured that identifies the file as being derived from the earlier existing file. Metadata Information 280 is presented by Data Compute Interface 220 to Data Service Interface 240 in unified discovery format that is then provided to duplication sensor.
- a user is able to Search and Discover Data Of Interest 270 across the data by accessing Data Discovery Interface 242 at Data Service Interface 240.
- Data Lineage and Data Audit 280 are pushed from Data Compute Interface 220 to Data Service Interface 240.
- Data Service Interface 240 receives input for querying the metadata associated with the data at the data storage systems.
- Tag Data, and Created Classification and Categorizations 271 are associated with the data.
- Data Query Interface 260 is able to Create Views Over Existing Data 272 and supports Queries Over Multiple Data Stores 273.
- Data Service Interface 240 includes the data definitions that are provided to Data Query Interface 260.
- Data Catalog 244 For Data Catalog 244 or a Data Object being pushed into Data Service Interface 240, the data is optimized before the Data Catalog 244 is created. For example, one or more users try to process the same data twice. An existing pipeline that is used to create a compute job to write data is able to exist while a second pipeline associated with the same data is created. However, one Data Lineage 246 is created. Data Discovery Interface 242 determines that data already exists in the storage system 231-234 and the Data Storage Layer 230 does not create anything new again.
- Data Service Interface 240 fetches the Metadata Definitions 284 from Data Service Interface 240 and generates corresponding tables or views.
- Data Query Interface 260 implements applications that are used to search the Data Storage Layer 230 across the Data Catalog 244. Users can also query any Data Catalog 244 via Data Query Interface 260.
- a query is generated by Data Query Interface 260, a dataset is identified by Data Query Interface 260 at the Data Storage Layer 230 based on the input to Data Discovery Interface 242 of Data Service Interface 240, and the metadata and the data definitions fetched form the Data Service Interface 240.
- Data Governance (Access Control Layer) 250 controls access to the Data Storage Layer 230 by requesting access rights to the Data Storage Layer 230.
- Query Interface 260 In response to a grant of access to the identified dataset, Query Interface 260 generates a query for reading the identified dataset based on the input, the metadata, and the data definitions.
- Data Storage Layer 230 includes a proxy layer for providing high availability for data discovery of the data at the data storage systems in response to the query and for processing search requests to and responses from a plurality of applications under control of a central coordinator of the Data Storage Layer 230, wherein the central coordinator identifies an active leader from the applications and enables the proxy layer to forward requests to the active leader and to receive responses from the active leader. Data is then fetched directly from data source of Data Storage Layer 230.
- a Create First Level Reports For Analytics 274 is associated with the dataset identified at the data storage is able to be created based on a query.
- Pattern recognition and classification algorithms run in the background on the basis of the Data Catalog 244 and Data Lineage 246.
- Data Discovery 242 and Data Catalog 242 of Data Service Interface 240 provides a mechanism that helps users or an application to identify pre-existing dataset patterns by suggesting initial possible matches of the datasets in the Data Storage Layer 230 and the Data Lineages 246.
- the searching of the Data Catalog 244 includes crawling through the Data Catalog 244 and the Data Lineage 246, and presenting a suggestion of possible matches of a similar datasets or a similar set of operations to perform to generate the similar dataset.
- Figs. 3A-B illustrate addition of Synchronous Mode to the Data Compute Interface according to at least one embodiment.
- Fig. 3A illustrates Data Computer Interface without Synchronous Mode 300.
- operation of existing tools such as Apache Spark 310 is shown.
- Apache Spark 310 provides a data-processing and analytics engine that implements stage-oriented scheduling using Jobs and Stages.
- a new dataset is created.
- the data is to be captured in a consistent manner.
- the existing tools e.g., Apache Spark 310
- Spark Context 310 reflects jobs that run on top of data to generate new data points.
- Spark 310 is implemented with a connector that will push the process details to Data Discovery Interface.
- Events 320 such as from DAG Scheduler 322, SQL Execution 324, Streaming Executions 326, etc., are pushed to a Live Listener Bus 312. Listeners 312 are components notified when events happen.
- DAG Scheduler 322 transforms a logical execution plan (RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).
- Resilient Distributed Datasets is a fundamental data structure of Spark, and is an immutable distributed collection of objects.
- a Spark job is a computation sliced into a set of parallel tasks or stages.
- SQL Execution 324 produces SQL execution metrics.
- Streaming Execution 326 produces batches of data.
- An Asynchronous Event Queue 314 receives Events 320 from an event handler, e.g., Post All Events Process 316. Post Events 330 are provided to an Events List Per Thread 332. Multiple Listener Threads 334 are provided and the events from Events List 332 are provided to the appropriate listener thread in the Multiple Listener Threads 334.
- an event handler e.g., Post All Events Process 316.
- Post Events 330 are provided to an Events List Per Thread 332.
- Multiple Listener Threads 334 are provided and the events from Events List 332 are provided to the appropriate listener thread in the Multiple Listener Threads 334.
- Live Listeners Bus 312 allows events which Spark emits during application execution to be tracked. Events 320 are typically application start/end, job start/end, stage start/end etc. (reference link) Listeners are spawned as a child thread outside of Spark Context 310. Due to this, there is a possibility that in response to the spark main thread completing its functionality and functionality is still being carried out by Listener Threads 334, the thread is killed as the thread is an asynchronous event that the Spark main thread does not wait for completion.
- Fig. 3B illustrates a Data Computer Interface with Synchronous Mode 350 according to at least one embodiment.
- Spark Context 360 reflects jobs that run on top of data to generate new data points. In order to capture the transformation steps being applied on data, Spark 360 is implemented with a connector that will push the process details to Data Discovery Interface. Events 370, such as from DAG Scheduler 372, SQL Execution 374, Streaming Executions 376, etc., are pushed to a Live Listener Bus 362.
- An Asynchronous Event Queue 364 receives Events 320 from an event handler, e.g., Post All Events Process 366.
- a Synchronous Mode is added, wherein the Spark main thread waits for the completion of listener threads. Events, such as from DAG Scheduler, SQL Execution, Streaming Executions, etc., are pushed to a Live Listener Bus.
- a Synchronous Event Queue 367 is also able to receive Events 320 from an event handler, e.g., Post All Events Process 366. Synchronous Event Queue 367 ensures that Data Lineage is captured before the job is completed.
- More than one instance of Apache Atlas Connector are active. To be able to determine which connector is active contacting Atlas Connector, an additional layer is added on top to check the status and provide a response. Synchronous events are provided to Synchronous Event List 368. From Sync Event List 368, Synchronous events are provided to Multiple Sync Listeners 369. The active connector that is active is requested and connections are checked until an active one is identified.
- Post Events 380 are then provided to an Events List Per Thread 382.
- Multiple Listener Threads 384 are provided and the events from Events List 382 are provided to the appropriate listener thread in the Multiple Listener Threads 384.
- Fig. 4 illustrates high availability of Apache Atlas application pods 400 according to at least one embodiment.
- Apache Atlas 410 is used to provide aspects of Data Discovery, Data Catalogs, and Data Lineage. Apache Atlas 410 provides native support to Hadoop based systems. Apache Atlas 410 is able to be deployed on Google Kubemetes (GK) Cluster, Amazon Kubemetes Cluster, a plain Kubemetes Cluster, or any other Kubernetes-Based Clusters. In Fig. 4, instances of Apache Atlas 410 are provided as illustrated by Pods 1 412, Pod 2 414, and Pod 3 416. Apache Atlas 410 is able to has more or less instances than Pods 1 412, Pod 2 414, and Pod 3 416.
- Apache Atlas is coupled to Yugabyte 420 and to Search Engine 430, e.g., Apache Solr.
- Apache Atlas 410 uses and interacts with a variety of systems to provide metadata management and Data Lineage.
- Apache Solr 430 is used to index the Atlas Data so that the data is able to be searched.
- Apache Atlas is not reachable in response to failover from an active instance to standby instance on Kubemetes.
- a Proxy Layer 440 is used to interface between Apache Atlas 410, and Users 450 and External Jobs 452.
- a central coordinator such as Zookeeper 460, is used to manage access to an active instance.
- Zookeeper 460 is a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems.
- Zookeeper 460 provides an infrastructure for cross-node synchronization by maintaining status type information in memory on Zookeeper 460.
- Zookeeper 460 keeps a copy of the state of the system and persists this information in local log files.
- Large Hadoop clusters are able to be supported using multiple Zookeepers 460, with a master server synchronizing the top-level servers.
- Pod 1 412 and Pod 3 414 are inactive, and Pod 3 416 is Active. In response to an instance not being available, that instance is switched from Active to Inactive.
- Proxy Layer 440 is able to Check The Status 442 of Pods 1 412, Pod 2 414, and Pod 3 416. Thus, User 450 is able to access an Active instance without knowing which of Pods 1 412, Pod 2 414, and Pod 3 416 are Active or Inactive. Proxy Layer 440 determines which of Pods 1 412, Pod 2 414, and Pod 3 416 is Active.
- Proxy Layer 440 automatically determines that Pod 3 416 is Active and Pod 3 is thus selected to service user requests.
- the other instances e.g., Pods 1 412 and Pod 414, will automatically be deemed Inactive.
- one of the other instances In response to the Active instance becoming unavailable, either because the instance is deliberately stopped, or due to failures, one of the other instances automatically is elected as an Active instance and starts to service user requests.
- An Active instance is the instance that is able to respond to user requests correctly. The Active instance is able to create, delete, modify or respond to queries on metadata objects.
- Zookeeper 460 takes care of election of the Current Leader Selection 462, i.e., Active instance (e.g., Pod 416), where Pod 412 and Pod 414 are down or Inactive. Proxy Layer 440 keeps track of the current Active instances and forwards Requests 456, 458 to the Active instance. Thus, Zookeeper is able to provide a Response 418 to Proxy Layer 440 from the Active Leader, e.g., Pod 3 416. Proxy Layer 440 then forwards Response 444 to User 450 or Response 446 to External Jobs 452.
- Active instance e.g., Pod 416
- Proxy Layer 440 then forwards Response 444 to User 450 or Response 446 to External Jobs 452.
- Fig. 5 illustrates simple Data Linage 500 as provided by existing components.
- a First Dataset, User Data 510, and a Second Dataset, Order Data 520 have been joined at Data Generation Process 530 to create a Third Dataset, Order X User Data 540.
- the user is able to use the Third Dataset, Order X User Data 540.
- Data Lineage identifies that the Third Dataset, Order X User Data 540, originated from the First Dataset, User Data 510, and the Second Dataset, Order Data 520, and that Data Generation Process 530 was used to generate the Third Dataset, Order X User Data 540.
- Some existing products provide this type of Data Lineage.
- Some existing products also provide a SQL process that generates the Third Dataset.
- Fig. 6 illustrates capture of discrete processes in Detailed Data Lineage 600 according to at least one embodiment.
- Detailed Data Lineage 600 for joining User Data 610 and Order Data 620 is broken down to provide details of the process prior to the Join Process 630. Overall the process is similar to where User-Data 610 and Order Data 620 are transformed to produce Order X User Data 640. However, the discrete internal processes that occur before generating the Third Dataset, i.e., Order X User Data 630, are captured. [0071] In Fig. 6, the Detailed Data Lineage 600 is broken down at a low level to represent intermediate data and intermediate processes as well to make the lineage more informative. The capture of the discrete processes helps end users to decide upon data usage accordingly.
- the User Data 610 is provided to User Data Modified Generator 650 to produce User Data Modified 642.
- User Data 610 is modified according to Modification Operation 654 of “select user id, state, city, pincode, emai, phone number, split(name, ”)(0) as f_name.split)name, ”)(1) as I name, toUpper(f name), toUpper(l name), todate(creation timestamp) as creation date, dated ff(now, creation date) as system age from user data” to produce User Data Modified 652.
- Modification Operation 654 of “select user id, state, city, pincode, emai, phone number, split(name, ”)(0) as f_name.split)name, ”)(1) as I name, toUpper(f name), toUpper(l name), todate(creation timestamp) as creation date, dated ff(now, creation date) as system age from user data” to produce User Data Modified 652.
- User Data Modified 652 is filtered at User Data Filtered Generator 660 to remove some irrelevant data to produce User Data Filtered 662.
- Filter Operation 664 of “select user id, state, city, pincode, email, phone number, split(name, ”)(0) as f_name.split(name, ”)(1) as I name, toUpper(f name), toUpper(l name), todate(creation timestamp) as
- Order_ Data 620 is joined with the User Data Filtered 662 to produce Order X User Data 640.
- Modification Operation 654 and Filter Operation 664 those skilled in the art recognize that embodiments described herein are not meant to be limited to the Join Operation 632 described here, but other or additional join operations are capable of being made.
- the capture of the details of the processes in the Detailed Data Lineage 600 enables a user to determine what is happening internally, and is not limited to just some processes, name, or basic sequence.
- the Detailed Data Lineage 600 enables a user to make confident decisions based on the details of the processes or sequences.
- Fig. 7 illustrates Data Pattern Detection and Optimization 700 according to at least one embodiment.
- Data Service 710 includes a Data Discovery Interface 720 that enables users/applications to Search 722 for data across the Data Catalog and Data Lineage 730.
- Data Discovery Interface 720 provide search query 724 to Data Catalog and Data Lineage 730.
- Background jobs 740 feed data to a Process 750 that models and also automatically tags and classifies data.
- Host Artificial Intelligence/Machine Learning (AI/ML) 760 run in the background on the basis of the Data Catalogue and Data Lineage 730.
- Users 770 via, for example Web User Interface (UI) 772, or Applications 780 are able to use AI/ML 760 to identify pre-existing dataset patterns using suggestions of possible matches of the datasets and the Data Catalog and Data Lineage 730.
- the searching of the Data Catalog and Data Lineage 730 includes the AI/ML crawling through the Data Catalog and the Data Lineage 730 and presenting a suggestion of possible matches of a similar dataset or a similar set of operations to perform to generate the similar dataset.
- Data Compute Interface 790 a user is able to first select the source dataset, then define set of operations/transformations and run jobs to execute the operations to generate the datasets 792. While defining those jobs, internally, Data Compute Interface 790 uses the pattern detection capability provided by Data Discovery Interface 720, with inputs as the source dataset and set of transformations being applied, to present the user with a pre-existing patterns in the system. This prevents user in creating a duplicate set of jobs and datasets.
- the results presented include the list of pre-existing operations that are applied on the item.
- any external Application 780 is able to use the pattern detection capability provided by AI/ML 760 to avoid duplicity of datasets and compute jobs. Also, there are background jobs that run continuously feeding Data Catalog and Data Lineage 730 to the hosted Models 750. This will help in creating system generated tags and classifications that are able to be directly used by the user for enhancing the search, or used internally by Data Discovery Interface 730 to detect similar datasets and transformation patterns.
- Fig. 8 is a flowchart 800 of a method for providing large scale data discovery with near real-time data cataloging and detailed lineage according to at least one embodiment.
- a Data Compute Interface 220 includes a Data Generation and Transformation Layer that writes or reads data toward Data Storage Layer 230.
- Data governance (Access Control Layer) 250 controls access to the Data Storage Layer 230 by determining access rights to the Data Storage Layer 230.
- the received data is processed at the data storage systems using sensors at the data storage systems S818.
- Sensors 238 are written for the database that categorizes data into new data and changed data, and captures change information that is pushed to Data Computer Interface 220 and then to Data Service Interface 240.
- Sensors 238 detect changes in data and then push the changes to the Data Computer Interface 220 and then to Data Service Interface 240.
- a file that is written to the database, and Sensors 238 sends the file to Data Compute Interface 220. Metadata Capture Services pull these and create Data Catalog 244 and Data Lineage 246.
- Metadata and data definitions associated with the data are sent to a data service in near real-time S822.
- Data Compute Interface 220 pushes Metadata Create Data Lineage and Data Audits 280 towards Data Service Interface 240.
- Data definitions are also able to be provided by Data Compute Interface 220 to Data Service Interface 240.
- a data catalog is generated at the data service based on the metadata and data definitions associated with the data S826.
- Data Service Interface 240 includes a Data Discovery Interface 242 and presents Data Catalog 244 and Data Lineage 246.
- Data Lineage 246 is captured that identifies the file as being derived from the earlier existing file.
- Metadata Information 280 is presented by Data Compute Interface 220 to Data Service Interface 240 in unified discovery format that is then provided to duplication sensor.
- Data lineage for the data is crated at the data service S830.
- the Data Catalog 244 presents catalogues related information, such as lineage, data path, owner, data description, etc. in Data Discovery 242.
- the data catalog and data lineage are presented at a data discovery interface of the data service S834.
- a user is able to Search and Discover Data Of Interest 270 across the data by accessing Data Discovery Interface 242 at Data Service Interface 240.
- Data Compute Interface 220 pulls Metadata Change Events 228.
- Data Lineage 246 is created to present detailed, granular lineage information including details such data classifications and business topology that are tagged to the metadata for use in performing complex data transformations.
- Input is received via the data discovery interface for querying the metadata associated with the data at the data storage systems S838.
- Data Lineage and Data Audit 280 are pushed from Data Compute Interface 220 to Data Service Interface 240.
- Data Service Interface 240 receives input for querying the metadata associated with the data at the data storage systems.
- Tag Data, and Created Classification and Categorizations 271 are associated with the data.
- a dataset at the data storage systems is identified by the query interface based on the input, the metadata, and the data definitions S842.
- a query is generated by Data Query Interface 260
- a dataset is identified by Data Query Interface 260 at the Data Storage Layer 230 based on the input to Data Discovery Interface 242 of Data Service Interface 240, and the metadata and the data definitions fetched form the Data Service Interface 240.
- Data Governance (Access Control Layer) 250 controls access to the Data Storage Layer 230 by requesting access rights to the Data Storage Layer 230.
- the query interface In response to a grant of access to the identified dataset, the query interface generates a query for reading the identified dataset based on the input, the metadata, and the data definitions S850.
- the query interface in response to a grant of access to the identified dataset, Query Interface 260 generates a query for reading the identified dataset based on the input, the metadata, and the data definitions.
- Data Storage Layer 230 includes a proxy layer for providing high availability for data discovery of the data at the data storage systems in response to the query and for processing search requests to and responses from a plurality of applications under control of a central coordinator of the Data Storage Layer 230, wherein the central coordinator identifies an active leader from the applications and enables the proxy layer to forward requests to the active leader and to receive responses from the active leader. Data is then fetched directly from data source of Data Storage Layer 230.
- At least one embodiment of the method for providing large scale data discovery with near real-time data cataloging and detailed lineage includes receiving data from one or more sources, writing the received data to a data storage systems, processing the received data at the data storage systems using sensors at the data storage systems, sending metadata and data definitions associated with the data to a data service in near real-time, generating, at the data service, a data catalog based on the metadata and data definitions associated with the data, creating, at the data service, data lineage for the data, and presenting the data catalog and data lineage at a data discovery interface.
- FIG. 9 is a high-level functional block diagram of a processor-based system 900 according to at least one embodiment.
- processing circuitry 900 provides large scale data discovery with near real-time data cataloging and detailed lineage.
- Processing circuitry 900 implements large scale data discovery with near real-time data cataloging and detailed lineage using Processor 902.
- Processing circuitry 900 also includes a Non-Transitory, Computer-Readable Storage Medium 904 that is used to implement a data platform for providing large scale data discovery with near real-time data cataloging and detailed lineage.
- Non-Transitory, Computer- Readable Storage Medium 904 is encoded with, i.e., stores, Instructions 906, i.e., computer program code, that are executed by Processor 902 causes Processor 902 to perform operations for providing large scale data discovery with near real-time data cataloging and detailed lineage. Execution of Instructions 906 by Processor 902 represents (at least in part) an application which implements at least a portion of the methods described herein in accordance with one or more embodiments (hereinafter, the noted processes and/or methods). [0098] Processor 902 is electrically coupled to Non-Transitory, Computer-Readable Storage Medium 904 via a Bus 908.
- Processor 902 is electrically coupled to an Input/Output (VO) Interface 910 by Bus 908.
- a Network Interface 912 is also electrically connected to Processor 902 via Bus 908.
- Network Interface 912 is connected to a Network 914, so that Processor 902 and Non-Transitory, Computer-Readable Storage Medium 904 connect to external elements via Network 914.
- Processor 902 is configured to execute Instructions 906 encoded in Non- Transitory, Computer-Readable Storage Medium 904 to cause processing circuitry 900 to be usable for performing at least a portion of the processes and/or methods.
- Processor 902 is a Central Processing Unit (CPU), a multi-processor, a distributed processing system, an Application Specific Integrated Circuit (ASIC), and/or a suitable processing unit.
- CPU Central Processing Unit
- ASIC Application Specific Integrated Circuit
- Processing circuitry 900 includes VO Interface 910.
- VO interface 910 is coupled to external circuitry.
- VO Interface 910 includes a keyboard, keypad, mouse, trackball, trackpad, touchscreen, and/or cursor direction keys for communicating information and commands to Processor 902.
- Processing circuitry 900 also includes Network Interface 912 coupled to Processor 902.
- Network Interface 912 allows processing circuitry 900 to communicate with Network 914, to which one or more other computer systems are connected.
- Network Interface 912 includes wireless network interfaces such as Bluetooth, Wi-Fi, Worldwide Interoperability for Microwave Access (WiMAX), General Packet Radio Service (GPRS), or Wideband Code Division Multiple Access (WCDMA); or wired network interfaces such as Ethernet, Universal Serial Bus (USB), or Institute of Electrical and Electronics Engineers (IEEE) 864.
- Processing circuitry 900 is configured to receive information through VO Interface 910.
- the information received through VO Interface 910 includes one or more of instructions, data, design rules, libraries of cells, and/or other parameters for processing by Processor 902.
- the information is transferred to Processor 902 via Bus 908.
- Processing circuitry 900 is configured to receive information related to a User Interface (UI) through VO Interface 910.
- UI User Interface
- the information is stored in Non-Transitory, Computer-Readable Storage Medium 904 as UI 922.
- one or more Non-Transitory, Computer-Readable Storage Medium 904 having stored thereon Instructions 906 (in compressed or uncompressed form) that may be used to program a computer, processor, or other electronic device) to perform processes or methods described herein.
- the one or more Non-Transitory, Computer-Readable Storage Medium 904 include one or more of an electronic storage medium, a magnetic storage medium, an optical storage medium, a quantum storage medium, or the like.
- the Non-Transitory, Computer-Readable Storage Medium 904 may include, but are not limited to, hard drives, floppy diskettes, optical disks, read-only memories (ROMs), random access memories (RAMs), erasable programmable ROMs (EPROMs), electrically erasable programmable ROMs (EEPROMs), flash memory, magnetic or optical cards, solid-state memory devices, or other types of physical media suitable for storing electronic instructions.
- the one or more Non- Transitory Computer-Readable Storage Media 904 includes a Compact Disk-Read Only Memory (CD-ROM), a Compact Disk-Read/Write (CD-R/W), and/or a Digital Video Disc (DVD).
- Non-Transitory, Computer-Readable Storage Medium 904 stores Instructions 906 configured to cause Processor 902 to perform at least a portion of the processes and/or methods for implementing a Data Platform as described herein. In one or more embodiments, Non-Transitory, Computer-Readable Storage Medium 904 also stores information, such as algorithm which facilitates performing at least a portion of the processes and/or methods for implementing a Data Platform as described herein.
- Processor 902 executes Instructions 906 at Non-Transitory Computer- Readable Medium to implement an Authentication and Middleware Interface 926 for receiving input from one or more sources, for example , through User Interfaces 920, such as a Web UI and/or an Onboarding Interface. Processor further executes Instructions 906 at Non-Transitory Computer-Readable Medium to implement Data Governance 921, Data Storage Layer 922, Data Compute Interface 923, Data Discovery Interface 924, and Query Interface 925. Processor 902 executes Instructions 906 at Non-Transitory Computer-Readable Medium to implement a User Interface 932 on a Display 930 for providing large scale data discovery with near real-time data cataloging and detailed lineage.
- User Interface 932 is able to display data as managed in Data Platform 900 by Processor 902.
- User Interface 932 includes a Data Compute Interface 940, a Data Service Interface 950, and a Query Interface 960.
- Data Compute Interface 940 includes a Data Generation and Transformation Layer that writes or reads data toward Data Storage Layer.
- Data Governance Access Control Layer
- Data Governance controls access to the Data Storage Layer 922 by determining access rights to the Data Storage Layer 922.
- New data and changes are captured by Data Storage Layer 922 and are pushed to Data Compute Interface 940 and then to Data Service Interface 950. For example, a file that is written to the database, and Sensors 972 at Data Storage Systems 970 send the file to Data Compute Interface 940.
- Metadata Capture Services pull these and create Data Catalog 954 and Data Lineage 956 at Data Service Interface 950.
- Data Compute Interface 940 pushes Metadata, creates Data Lineage 956, and performs Data Audits. This information is pushed to Data Service Interface 950 in near real-time. Data definitions are also able to be provided by Data Compute Interface 940 to Data Service Interface 950.
- Data Service Interface 950 includes a Data Discovery Interface 952 and presents Data Catalog 954 and Data Lineage 956. In response to the file being generated from an already existing file, Data Lineage 956 is captured that identifies the file as being derived from the earlier existing file. Metadata Information is presented by Data Compute Interface 940 to Data Service Interface 950 in unified discovery format that is then provided to duplication sensor.
- Data Catalog 954 presents catalogues related information, such as lineage, data path, owner, data description, etc. in Data Discovery Interface 952.
- a user is able to search and discover data of interest across the data by accessing Data Discovery Interface 952 at Data Service Interface 950.
- Data Compute Interface 940 pulls Metadata Change Events.
- Data Lineage 956 is created to present detailed, granular lineage information including details such data classifications and business topology that are tagged to the metadata for use in performing complex data transformations.
- Data Lineage 956 and Data Audit are pushed from Data Compute Interface 940 to Data Service Interface 950.
- Data Service Interface 950 receives input at Data Discovery Interface 952 for querying the metadata associated with the data at the Data Storage Systems 970. Tag data, and Classification and Categorizations data are associated with the data.
- a query is generated by Data Query Interface 960
- a dataset is identified by Data Query Interface 960 at the Data Storage Layer 922 based on the input to Data Discovery Interface 952 of Data Service Interface 950, and the metadata and the data definitions fetched from the Data Service Interface 950.
- Data Governance Access Control Layer
- Data Storage Layer 922 includes a proxy layer for providing high availability for data discovery of the data at the Data Storage Systems 970 in response to the query and for processing search requests to and responses from a plurality of applications under control of a central coordinator of the Data Storage Layer 922, wherein the central coordinator identifies an active leader from the applications and enables the proxy layer to forward requests to the active leader and to receive responses from the active leader. Data is then fetched directly from Data Storage Systems 970 by Data Storage Layer 922.
- Embodiments described herein thus provide certain advantages. These advantages include one or more of providing a one-stop solution for data usage lifecycle, providing discoverability of data to end user in near real time because of the minimization of manual intervention, enabling informed business decisions to be made due to the low level lineage with detailed steps involved in data transformation and other metadata details that includes schema, provides ease of use due to automatic creation of the Data Catalog and integration with a Query Layer, or provisioning of proper data meaning and accountability.
- the capability for pattern detection helps any user to detect/search any data on the basis of a type of operation(or a set of operations) that lead to the generation of a particular data. This knowledge helps users/applications avoid duplication of datasets and the operations that are used to generate them there-by reducing cost & effort for the same.
- An aspect of this description is directed to a method [1] for providing a data platform includes receiving data from one or more sources, writing the received data to one or more data storage systems, processing the received data at the one or more data storage systems using sensors at the data storage systems, sending metadata and data definitions associated with the data to a data service in near real-time, generating, at the data service, a data catalog based on the metadata and data definitions associated with the data, creating, at the data service, data lineage for the data, and presenting the data catalog and data lineage at a data discovery interface.
- the method described in [1], further includes receiving input via the data discovery interface for querying the metadata associated with the data at the data storage systems, and identifying a dataset at the data storage systems based on the input, the metadata, and the data definitions.
- the method described in [1] to [2], further includes requesting access to the identified dataset at the data storage, and in response to a grant of access to the identified dataset, generating a query for reading the identified dataset based on the input, the metadata, and the data definitions.
- the method described in [1] to [3], further includes providing high availability for data discovery of the data at the data storage systems in response to the query by providing a proxy layer for processing search requests to and responses from a plurality of applications under control of a central coordinator, wherein the central coordinator identifies an active leader from the applications and enables the proxy layer to forward requests to the active leader and to receive responses from the active leader.
- the method described in [1] to [4], further includes creating, at the data service, data audits associated with the data, and pulling, by the data compute interface, change events representing the data for the one or more sources at the data storage systems.
- An aspect of this description is directed to a data platform [8], including a memory storing computer-readable instructions, and a processor connected to the memory, wherein the processor is configured to execute the computer-readable instructions to perform operations to receive data from one or more sources, write the received data to a data storage systems, process the received data at the data storage systems using sensors at the data storage systems, send metadata and data definitions associated with the data to a data service in near real-time, generate, at the data service, a data catalog based on the metadata and data definitions associated with the data, create, at the data service, data lineage for the data, and present the data catalog and data lineage at a data discovery interface.
- An aspect of this description is directed to a non-transitory computer-readable media having computer-readable instructions stored thereon [15 ⁇ , which when executed by a processor causes the processor to perform operations including receiving data from one or more sources, writing the received data to a data storage systems, processing the received data at the data storage systems using sensors at the data storage systems, sending metadata and data definitions associated with the data to a data service in near real-time, generating, at the data service, a data catalog based on the metadata and data definitions associated with the data, creating, at the data service, data lineage for the data, and presenting the data catalog and data lineage at a data discovery interface.
- the non-transitory computer-readable media described in [15], further includes receiving input via the data discovery interface for querying the metadata associated with the data at the data storage systems, and identifying a dataset at the data storage based on the input, the metadata, and the data definitions, requesting access to the identified dataset at the data storage systems, and in response to a grant of access to the identified dataset, generating a query for reading the identified dataset based on the input, the metadata, and the data definitions.
- the non-transitory computer-readable media described in [15] to [16], further includes providing high availability for data discovery of the data at the data storage systems in response to the query by providing a proxy layer for processing search requests to and responses from a plurality of applications under control of a central coordinator, wherein the central coordinator identifies an active leader from the applications and enables the proxy layer to forward requests to the active leader and to receive responses from the active leader.
- the non-transitory computer-readable media described in [15] to [17], further includes creating, at the data service, data audits associated with the data, and pulling change events representing the data for the one or more sources at the data storage systems.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Strategic Management (AREA)
- General Engineering & Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Quality & Reliability (AREA)
- Economics (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Tourism & Hospitality (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Game Theory and Decision Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/264,581 US20250021585A1 (en) | 2023-04-07 | 2023-04-07 | Large scale data discovery with near real time data cataloguing and detailed lineage |
| PCT/US2023/017830 WO2024210902A1 (en) | 2023-04-07 | 2023-04-07 | Large scale data discovery with near real time data cataloguing and detailed lineage |
| JP2025543121A JP2026507433A (en) | 2023-04-07 | 2023-04-07 | Near real-time data cataloging and deep lineage for massive data discovery |
| EP23932268.8A EP4689921A4 (en) | 2023-04-07 | 2023-04-07 | Large-scale data discovery with near real-time data cataloging and detailed sequencing. |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2023/017830 WO2024210902A1 (en) | 2023-04-07 | 2023-04-07 | Large scale data discovery with near real time data cataloguing and detailed lineage |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024210902A1 true WO2024210902A1 (en) | 2024-10-10 |
Family
ID=92972572
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/017830 Ceased WO2024210902A1 (en) | 2023-04-07 | 2023-04-07 | Large scale data discovery with near real time data cataloguing and detailed lineage |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250021585A1 (en) |
| EP (1) | EP4689921A4 (en) |
| JP (1) | JP2026507433A (en) |
| WO (1) | WO2024210902A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140372588A1 (en) * | 2011-12-14 | 2014-12-18 | Level 3 Communications, Llc | Request-Response Processing in a Content Delivery Network |
| US20200026710A1 (en) * | 2018-07-19 | 2020-01-23 | Bank Of Montreal | Systems and methods for data storage and processing |
| US20230116563A1 (en) * | 2021-10-07 | 2023-04-13 | Starkey Laboratories, Inc. | Artifact detection and logging for tuning of feedback canceller |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2972770A4 (en) * | 2013-03-15 | 2016-11-16 | Ab Initio Technology Llc | METADATA MANAGEMENT SYSTEM |
| US9348879B2 (en) * | 2013-07-02 | 2016-05-24 | Bank Of America Corporation | Data lineage transformation analysis |
| CN108701257B (en) * | 2016-08-22 | 2023-01-06 | 甲骨文国际公司 | System and method for dynamic, incremental recommendation within real-time visual simulation |
| US11501191B2 (en) * | 2018-09-21 | 2022-11-15 | International Business Machines Corporation | Recommending machine learning models and source codes for input datasets |
| US11449508B2 (en) * | 2020-05-05 | 2022-09-20 | Microsoft Technology Licensing, Llc | Serverless data lake indexing subsystem and application programming interface |
-
2023
- 2023-04-07 WO PCT/US2023/017830 patent/WO2024210902A1/en not_active Ceased
- 2023-04-07 EP EP23932268.8A patent/EP4689921A4/en active Pending
- 2023-04-07 US US18/264,581 patent/US20250021585A1/en active Pending
- 2023-04-07 JP JP2025543121A patent/JP2026507433A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140372588A1 (en) * | 2011-12-14 | 2014-12-18 | Level 3 Communications, Llc | Request-Response Processing in a Content Delivery Network |
| US20200026710A1 (en) * | 2018-07-19 | 2020-01-23 | Bank Of Montreal | Systems and methods for data storage and processing |
| US20230116563A1 (en) * | 2021-10-07 | 2023-04-13 | Starkey Laboratories, Inc. | Artifact detection and logging for tuning of feedback canceller |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4689921A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4689921A1 (en) | 2026-02-11 |
| EP4689921A4 (en) | 2026-03-11 |
| US20250021585A1 (en) | 2025-01-16 |
| JP2026507433A (en) | 2026-03-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12204536B2 (en) | Query scheduling based on a query-resource allocation and resource availability | |
| US11580107B2 (en) | Bucket data distribution for exporting data to worker nodes | |
| US11487771B2 (en) | Per-node custom code engine for distributed query processing | |
| US11321321B2 (en) | Record expansion and reduction based on a processing task in a data intake and query system | |
| US11593377B2 (en) | Assigning processing tasks in a data intake and query system | |
| US11586627B2 (en) | Partitioning and reducing records at ingest of a worker node | |
| WO2023220948A1 (en) | Method, apparatus and system for configurable data collection for networked data analytics and management | |
| CN107003906B (en) | Type-to-type analysis of cloud computing technology components | |
| CN109997126B (en) | Event-driven extract, transform, load (ETL) processing | |
| US20190258635A1 (en) | Determining Records Generated by a Processing Task of a Query | |
| CN110908641B (en) | Visualization-based stream computing platform, method, device and storage medium | |
| US9336288B2 (en) | Workflow controller compatibility | |
| US9438665B1 (en) | Scheduling and tracking control plane operations for distributed storage systems | |
| US11934869B1 (en) | Enhancing efficiency of data collection using a discover process | |
| CN112597224B (en) | Data export method, data export device, electronic device and medium | |
| CN108510081A (en) | machine learning method and platform | |
| CN108062384A (en) | The method and apparatus of data retrieval | |
| US12174722B2 (en) | Characterizing operation of software applications having large number of components | |
| US9779177B1 (en) | Service generation based on profiled data objects | |
| US20250021585A1 (en) | Large scale data discovery with near real time data cataloguing and detailed lineage | |
| WO2024251107A1 (en) | Container orchestration method, data access method, and electronic device and storage medium | |
| CN120256460A (en) | Software package processing method, device, computer equipment and storage medium | |
| US11853304B2 (en) | System and method for automated data and workflow lineage gathering | |
| CN117931813A (en) | Lake bin metadata change determining method, device, equipment and medium | |
| US10789282B1 (en) | Document indexing with cluster computing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 18264581 Country of ref document: US |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23932268 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2025543121 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023932268 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| ENP | Entry into the national phase |
Ref document number: 2023932268 Country of ref document: EP Effective date: 20251107 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023932268 Country of ref document: EP |