EP3948576A1 - Datenbankübergreifender index auf einem verteilten datenbanksystem - Google Patents
Datenbankübergreifender index auf einem verteilten datenbanksystemInfo
- Publication number
- EP3948576A1 EP3948576A1 EP20716739.6A EP20716739A EP3948576A1 EP 3948576 A1 EP3948576 A1 EP 3948576A1 EP 20716739 A EP20716739 A EP 20716739A EP 3948576 A1 EP3948576 A1 EP 3948576A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- data
- database
- index
- additional
- tokens
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Definitions
- the invention relates to a method, a source computer system and a distributed database system for implementing a cross-database index on a distributed database system.
- complex distributed systems are emerging, the technical interaction of which raises new technical challenges.
- corresponding distributed systems generally do not have a shared memory, which makes it difficult to carry out data processing functions.
- the invention is based on the object of creating an improved method for implementing an index on a distributed database system.
- Embodiments include a computer-implemented method for implementing a cross-database index on a distributed database system which comprises a plurality of independent individual databases, the individual databases being communicatively connected to one another via a network,
- the individual databases are each managed by a multi-model database management system, the individual databases each comprising a plurality of database-specific data records which are stored in a document-oriented first data model of the respective individual database, the stored data records each comprising one or more field values - sen, whereby the individual field values of the stored data records are each stored in a field,
- the individual databases each further comprising a searchable first index which is stored in a second data model of the respective individual database, the index of the respective individual database being a plurality of data records stored in the document-oriented data model of the corresponding individual database comprises generated tokens, the token in the index being linked to one or more pointers to one or more of the data records stored in the document-oriented data model of the corresponding individual database, from whose field values the corresponding token was generated,
- Embodiments can have the advantage that by creating and sending the supplementary data record, the index of the receive database can be adapted to the changes in the index of the source database. Even though it is a distributed database system with a number of independent individual databases, a cross-database index can be implemented in this way. If all individual databases are configured to act as the source database when an additional data record is received, a uniform cross-database index of the distributed database system can be provided. If all individual databases are configured to receive the corresponding supplementary data records as receive databases, then all Individual databases are provided with an identical cross-database index of the distributed database system. This enables data processing functions to be carried out locally, taking into account all the data made available in the distributed database system.
- the index does not include the full context information that the underlying data records include. This full context information is not lost, however, but is available to the individual databases individually.
- the amounts of data that have to be exchanged between the individual databases to implement the cross-database index are reduced and, at the same time, the data in the data records are protected. This protection includes that only the individual, independent single databases have access to the data records stored in their respective first data model. Only this respective individual database therefore has access to the full information content of the corresponding data records.
- each of the individual databases in its first data model each comprises an independent plurality of data records, which can possibly differ significantly or even completely from data records which are contained in the first data models of other individual data banks are stored.
- the restricted access protects against unauthorized access to the data by third parties.
- Embodiments can have the advantage that the data processing functions are each carried out individually for each individual database. For example, each of the individual databases independently indexes data records that it receives. As a result, the data records can be processed in parallel. Thus, the processing of the data can be parallelized and thus accelerated.
- Embodiments are thus based on a distributed system or database system which comprises a plurality of nodes or computers, each of which makes an individual database available and can communicate with one another via communication interfaces.
- software i. E. the data processing functions, which can provide a plurality of functionalities, such as algorithms, formulas, models.
- the data processing function can be limited to indexing received data records and creating or managing an index based thereon. According to further embodiments, however, the data processing function can also include further functionalities, the application of which is reflected in the resulting structure of the index. The data provided by the index can thus be optimized for a variety of purposes.
- the software can, for example, be software which is installed on the individual nodes of the system, i.e. the individual databases, independently of the other nodes of the distributed system.
- the distributed system in this case refers to the fact that each of the nodes of the distributed system comprises a complete copy of the index.
- the resulting index indexes all individual databases of the distributed system, i.e. the index can be a combined overall index which includes all sub-indexes of the individual accounts of the distributed system.
- a distributed application describes a complex application program which runs on a distributed system, ie on several computers, and under exchanges this information.
- the corresponding application is distributed through horizontal cuts in the software layer model, so that the task of the overall software can be divided among individual software components in the form of the individual cuts.
- all components of the application that communicate with one another for this purpose must cooperate in order to fulfill the overall task.
- the distributed application is mostly transparent to a client, ie the distributed application appears as a uniform application.
- each individual database carries out a data processing function.
- This can involve the same data processing function in all individual databases, which is consequently parallelized, or it can be individual database-individual data processing functions, the execution of which is complementary.
- the creation of an index of different individual databases can differ in terms of the granularity of the tokens added to the index.
- sensor measured values for example, the exact measured values can be added to the index, a rounded measured value or assignments to a value interval can be added. If the values are from different sensors, there can be different requirements for them, which can be implemented by means of individual database-specific data processing functions.
- a distributed system is understood to mean a plurality of interacting processes or a plurality of interacting processors executing processes which do not have a common memory, but rather communicate with one another via messages.
- a distributed system thus enables independent computers or individual databases to be merged, each of which is presented as a single system for the respective user. In the present case, the respective user sees a cross-database index, for example. Only when data records are to be accessed directly that are stored in a data model of another individual database can it can be seen by the user that it is not a single system but rather a distributed system.
- Embodiments can have the advantage that the functionality of the complete index can be used on each of the nodes or on each of the individual databases without the corresponding nodes including all data records of the entire distributed database system. In this way, an overhead for the operation and storage of the data records for the remaining nodes or individual databases can be avoided on the respective nodes.
- Embodiments can have the advantage that a complete central storage of all data or a merging of all databases is not necessary at any location so that the entirety of the information of all indexes of all databases can be accessed and evaluated. This avoids a central instance collecting all data that could access all data. In particular, associated security risks can be avoided or minimized.
- Embodiments can have the advantage that data which are recorded by a node in the distributed network are linked directly to an owner or an identity.
- the received data records are each assigned directly to the received individual database or to a user of the corresponding individual database as the owner or owner.
- Embodiments can have the advantage that they enable functionalities of a software component to be operated in a targeted manner on individual network nodes in a distributed system. In this way, the memory and computing requirements on the relevant node can be minimized while the desired functionality is operated. This also makes it possible for relatively inefficient ones, for example in relation to other nodes in the distributed network, to carry out functions locally on site.
- Embodiments may have the advantage of providing an implementation of an edge computing system.
- Edge computing describes decentralized data processing at the edge of a network, the so-called edge.
- Edge computing is in contrast to cloud computing, in which computer applications, data and services are centralized, i.e. from central network node nodes, a plurality of decentralized network nodes are provided for use.
- edge computing is geared towards processing computer applications, data and services locally, if possible, or at least in the vicinity.
- This can have the advantage that data streams can be limited to local network areas in a way that conserves resources.
- resources that are not permanently connected to a network can be used, in particular mobile computer systems, e.g. Controllers, notebooks, smartphones, tablet computers and sensors.
- Different processes and structures can be used in edge computing, such as sensor networks, mobile data acquisition, mobile signature analysis and / or peer-to-peer and ad hoc networking.
- Edge computing can have the advantage that it can be used as an architecture concept for the Internet of Things (loT), which links clearly identifiable physical objects, i.e. "things", with a virtual representation or identity in a network structure, e.g. an "Internet” -like structure.
- LoT Internet of Things
- Edge computing Services that are provided in the course of edge computing can have the advantage that the data volume to be transmitted and thus the data exchange as well as the transmission paths can be significantly reduced, which means that transmission costs and waiting times can be reduced. As a result, the overall service quality can be improved.
- edge computing central data centers are not necessary or at most only a small number is necessary, which means that bottleneck effects at such data centers for data transfer can be avoided as well as the associated potential sources of error.
- the security of the data within the distributed system can also be increased. In particular, this enables potential security threats such as viruses, falsified data and / or hacker attacks to be recognized early and effectively averted.
- the ability to virtualize extends the scalability within the system, ie the number of edge devices in the network can be easily increased.
- edge computing real-time requirements in the Internet of Things can be better supported than is generally possible in the course of cloud computing due to the faster data processing.
- intelligent objects In everyday life, users increasingly surround themselves with electronic devices that have computer functionality and connectivity.
- One trend is the increasing spread of wearables, i.e. Computer systems which are arranged on the body of the user during use.
- Another trend is the replacement of everyday objects with “intelligent objects” (also called “smart devices”), which are equipped with information technology and are configured to process information themselves.
- intelligent objects are equipped with data processing hardware, such as a built-in microcontroller, communication interfaces and / or sensors, so that they can record, store and / or exchange data with one another.
- data processing hardware such as a built-in microcontroller, communication interfaces and / or sensors, so that they can record, store and / or exchange data with one another.
- data processing hardware such as a built-in microcontroller, communication interfaces and / or sensors
- a plurality of data acquisition devices which are arranged in the distributed database system, acquire or generate data locally and transmit these data or supplementary data records based on them to one or more other nodes via a communication interface over the network.
- the software By reducing the software to its core functionalities, memory requirements and computing capacity can be saved so that the corresponding software can be used effectively and efficiently, for example in loT sensors.
- Linking the data with an owner or with an identity increases the level of security of the data, as subsequent manipulation of the data by unauthorized persons can be excluded.
- Embodiments can have the advantage that they can provide an advantageous distributed database system for the Internet of Things.
- supplementing the index includes:
- Embodiments can have the advantage that data from additional data records can be efficiently inserted into the existing individual database and in particular into the index.
- the tokens generated using the additional data record are compared with the index. All tokens that the index does not (yet) include are added to the index as additional tokens. Further the additional tokens are linked with the pointer to the additional data record.
- the pointers can be used to determine which individual database contains the underlying data or data records.
- the pointer to the additional data record is added to the index.
- Embodiments can have the advantage that it can always be ensured that the index has or at least takes into account all tokens comprised by the data records in the database.
- the supplementary data set is sent from the source database via the network to a predefined first group of one or more individual databases of the plurality of individual databases as receive databases which the second individual database comprises.
- Embodiments can have the advantage that it can be ensured for a predefined first group that its members each have the complete cross-database index. This can be advantageous, for example, in an asymmetrical system in which there is no need to provide a complete index for one or more of the individual databases. This can be the case, for example, if no or only seldom local searches are carried out on these databases.
- the corresponding individual database serves primarily as a data store and for parallelizing the creation of the index.
- the supplementary data set is sent from the source database via the network to all further individual databases comprised of the plurality of individual databases as receive databases.
- Embodiments can have the advantage that a cross-database index can be implemented on all individual databases.
- a predefined second group of several individual databases of the plurality of individual databases is configured to function as a source database in each case when additional data records are received.
- all of the individual databases of the plurality of individual databases are configured to function as a source database in each case when additional data records are received.
- different degrees of standardization of the individual indexes of the individual databases can be achieved.
- a sensor is assigned to the source database and the additional data record comprises sensor data recorded by the sensor.
- the sensor data are indexed immediately after they have been recorded.
- each individual database of the plurality of individual databases is assigned one or more sensors, from which the respective individual database receives recorded sensor data in the form of additional data records.
- the data processing function further comprises normalizing the additional tokens in the course of adding to the index of the source database.
- the normalization fulfills the fifth and / or sixth normal form. Embodiments can have the advantage that redundancies can be avoided.
- the tokens can be stored in the form of relations or equivalent structures.
- a relation is a set of tuples.
- a tuple is a set of attribute values.
- An attribute describes a data type or a property assigned to one or more data. The number of attributes determines the degree, the number of tuples the cardinality of a relation.
- a normalization in particular a normalization of a relational data model, is a division of attributes into a plurality of Understand relations according to normalization rules, so that redundancies are reduced or minimized.
- a relational data model can be implemented, for example, in table-like data structures in which the relations are implemented in the form of tables, the attributes in the form of table columns and the tuples in the form of table rows.
- a relational data model can be brought into a normal form, for example, in that the relations of the data schema are progressively broken down into simpler relations based on the functional dependencies applicable to the corresponding normal form.
- Normal form (1 NF), 2nd normal form (2NF), 3rd normal form (3NF), Boyce-Codd normal form (BCNF), 4th normal form (4NF), 5th normal form (5NF), 6th normal form (6NF).
- the normalization criteria increase from normal form to normal form and each include the normalization criteria of the previous normal forms, i.e.
- a relation is in the first normal form if each attribute of the relation has an atomic range of values and the relation is free of repeating groups.
- the term atomic denotes an exclusion of composite, quantity-valued or nested value ranges for the attributes, i.e. relative attribute value ranges, understood. Freedom from repeating groups requires that attributes that contain the same or similar information are swapped out in different relations.
- a relation is in the second normal form if it fulfills the requirements of the first normal form and no non-primary attribute is functional from a real one Subset of a key candidate depends.
- a non-primary attribute is an attribute that is not part of a key candidate. This means that each non-primary attribute is dependent on all whole keys and not just on part of a key.
- Relations in the first normal form, the key candidates of which are not composed but consist of a single attribute, automatically satisfy the second normal form.
- a key candidate is understood here to be a minimal set of attributes that uniquely identify the tuples of a relation.
- a relation is in the third normal form if it fulfills the requirements of the second normal form and no non-key attribute depends transitively on a key candidate.
- An attribute is transitively dependent on a key candidate if the corresponding attribute is dependent on the corresponding key candidate via a further attribute.
- a relation is in Boyce-Codd normal form if it meets the requirements of the third normal form and every determinant is a super key.
- a determinant is understood here to be a set of attributes on which other attributes are functionally dependent. A determinant thus describes the dependency between attributes of a relation and specifies which attribute sets determine the value of the other attributes.
- a super key is a set of attributes in a relation which uniquely identify the tuples in this relation. The attributes of this set therefore always include different values for tuples selected in pairs. The key candidate is therefore a minimal subset of the attributes of a super key, which enables the tuple to be identified.
- a relation is in the fourth normal form if it meets the requirements of the Boyce-Codd normal form and does not include any nontrivial multivalued dependencies.
- a relation is in the fifth normal form if it fulfills the requirements of the fourth normal form and does not include any multivalued dependencies that are interdependent.
- the fifth normal form is thus present if every non-trivial compound dependency is implied by the key candidates.
- a compound dependency is implied by the key candidates of the output relation if every relation of the set of relations is a super key of the output relation.
- a relation is in the sixth normal form if it fulfills the requirements of the fifth normal form and does not include any nontrivial union dependencies.
- a relation is sufficient for a join dependency of a plurality of relations if the relation as the starting relation can be broken down into the corresponding set of relations without loss.
- the compound dependency is trivial if one of the relations in the set of relations has all the attributes of the initial relation.
- the data processing function further comprises an assignment of rights for managing and / or processing the data of the additional data record.
- an owner can be assigned directly to the additional data record in this context. This owner can, for example, be given certain access rights at the same time, which enables him to manage and / or process the data.
- the first multi-model database management system assigns an owner right with respect to the additional data record and / or the supplementary data record to a first entity assigned to the source database. The corresponding entity is thus established as the owner or owner of the additional data record and / or the supplementary data record.
- the data are linked directly to an owner or assigned to an owner by the node that acquires or generates the corresponding data.
- authorization chains are formed with the collected data on the resource-restricted nodes of the distributed system, for example loT sensors, with the recorded or generated data.
- the authorization chains define access rights for a number of entities.
- a central database with an authorization concept is not necessary on these systems, since the functionalities of the authorization chains are implemented directly on the individual nodes. As a result, data can be effectively and efficiently linked to an owner even on resource-constrained nodes.
- Linking the data directly during data generation on the individual resource-restricted nodes of the distributed system with an owner can have the advantage of increasing data security and guaranteeing transparency about the actual owners of the respective data.
- the first entity is a first user of the source database or a first computer system assigned to the source database. Is it an assigned, e.g. the computer system comprising the source database, this can be of advantage, for example, in applications in the area of the Internet of Things, since each data record recorded by a corresponding sensor system is automatically assigned a unique owner in the course of being entered in the corresponding individual database.
- access authorization certificates are stored in the source database as part of the additional data record and / or the supplementary data record, the access authorization certificates comprising one or more of the following access authorization certificates:
- Access credentials can be designed as access certificates. According to embodiments, the access certificates are pure numerical values, not complex x509 certificates. Metadata regarding the validity of the certificates and regarding other aspects can be stored separately from the actual access certificate in an additional ID database of an access management system of the corresponding individual database. Each data record preferably contains in its corresponding fields all access certificates of the user creating this data record or of the entity to which this data record is inertially assigned.
- a data record DS is created by user U1 and user U1 has exactly 3 types of certificates according to the content of the ID database (a read access certificate “U1 .Z-Zert [R]”, a write access certificate “U1 .Z-Zert [W] "and an index access certificate” U1 .Z-Zert [S] "), copies of exactly these 3 access certificates are created from the ID database during the storage of the data record DS and which are saved in the corresponding fields of the data record DS.
- the user U1 can also be the corresponding individual database itself or an identity assigned to this individual database. For example, a secret key of an asymmetrical cryptographic key pair assigned to the corresponding individual database can serve as proof of authentication or proof of identity.
- the identity can also be based on technical characteristics, such as certain voltage or current patterns of the corresponding individual database or of a computer system comprising the corresponding individual database, comparable to biometric features in the case of living people.
- the individual database and / or components of the same have means for protection against unauthorized manipulation.
- the means for protection against unauthorized manipulation ensure the trustworthiness of the individual database or its control elements, such as processors, that is to say its function as a “trust anchor”, through technical measures.
- the individual database comprises components or is implemented on components which are configured by a trustworthy institution, such as, for example, by a trust center (trust service provider), and which are required cryptographic key material were provided.
- the means for protection against unauthorized manipulation can ensure that security-relevant functionalities of the individual database or of the computer system comprising this individual database are not modified in an unauthorized manner.
- the means for protecting against unauthorized manipulation are designed as a so-called tamper proof module or trusted platform module (TPM), which is also referred to as a tamper resistant module (TRM).
- TPM trusted platform module
- TRM tamper resistant module
- the TRM checks whether the signature or the signatures are valid. The test can also be based on the technical characteristics mentioned above. If one of the signatures is not valid, the TRM blocks the use of the individual database.
- TPM includes, for example, microcontrollers according to the TCG specification as in ISO / IEC 1 1889, which add basic security functions to a computer or similar device.
- the means for protection against unauthorized manipulation contain mechanical means which, for example, are intended to prevent components and / or the entire computer system from being opened, or which, when an attempt is made to intervene, render the corresponding components unusable for example if data is lost and / or data is blocked.
- mechanical means which, for example, are intended to prevent components and / or the entire computer system from being opened, or which, when an attempt is made to intervene, render the corresponding components unusable for example if data is lost and / or data is blocked.
- safety-critical parts can be cast in epoxy resin for this purpose, an attempt to remove a relevant component from the epoxy resin leads to the inevitable destruction of this component.
- the means for protecting against unauthorized manipulation can be designed as a so-called hardware security module (HSM).
- HSM hardware security module
- “Z.Zert_U1 [R]” of user U1 and the user certificate of user U2 are saved, thereby forming an authorization chain object. If the user U2 now wants to access the data record DS at a later point in time, the individual database containing the data record DS automatically sends an authorization request to the ID database of the access management system in response to the access request from user U2. This request contains the access rights of the creator U1 stored in the data record DS. The ID database then checks whether a user certificate assigned to the user U2 is stored in the ID database linked to one or more of the access rights of the creator U1 stored in the data record DS. Only if this is the case, and if the user U2 is also the owner of the source database, is he allowed to access the data record.
- the at least one access certificate which is preferably stored as part of the data record created, comprises several access certificates for other types of access.
- the multiple access certificates include, for example, a write access certificate Z.Zert_U1 [W] of the creating user and / or a read access certificate Z.Zert_U1 [R] of the creating user and / or an index access certificate Z.Zert_U1 [S] of the creating user.
- the access management system is a “standard” access management system that is usually installed on a standard computer system.
- a “standard” access management system can be MySQL, PostGreSQL, Oracle, SAP Hana, etc.
- other systems designed for the structured storage and retrieval of data are used as access management systems, for example microcontrollers, in whose memory the data records, certificates and authorization chain objects and the program logic described here can be stored.
- a user certificate is, for example, a certificate that is issued specifically for a particular person by a certification authority (CA).
- CA certification authority
- the user certificate is arranged in a testable manner in a certificate chain issued by a certification authority, so it can be tested for example up to the root certificate of the certification authority. This can be advantageous because certification bodies are already widely accepted as independent guarantors of trust and are already used by many existing technical systems to check the authenticity of certain users and user actions.
- a plurality of user certificates and / or a plurality of access certificates are stored in an ID database of the access management system.
- the ID databases each contain access authorization chain objects.
- An authorization chain object is a data object which contains one of the access certificates and one or more of the user certificates (and thereby assigns them to one another).
- the sequence of the user certificates in the access authorization chain object reproduces the sequence of users who have issued this access certificate for other users whose user certificate is contained in the access authorization chain object.
- An authorization hierarchy generated in this way comprises one of the access authorization chain objects.
- each of the individual databases comprises a corresponding ID database which manages the access authorizations to the data records and / or the index of the corresponding individual database.
- the individual databases or their data models are free of access authorization chain objects and contain only for each data record the access certificates that grant the user who created this data record or to whom this data record is assigned access to this data record.
- the linking of these access rights of the data set creator with one or more other users in order to grant this access to the data set is not stored in the individual databases or their data models, but in the ID database.
- the ID database does not contain any reference to individual data records in the individual database or its data models. This can be advantageous since the size and complexity of the individual data records in the individual databases or their data models are limited and logically largely decoupled from the administration of the access rights.
- the size of the data records is also limited by the fact that the complete chain of authorization transfers is not saved as part of the data records. In particular in the case of a plurality of small data records with an identical authorization structure, this can considerably reduce the storage space required by the database.
- the ID database of the access management system contains a private signing key.
- the individual database contains a public signature verification key, which is designed to verify the signatures created with the signature key.
- the access management system determines whether the user certificate in the ID database is assigned one or more access authorization certificates, a copy of which is stored as part of the data record. If this is the case, this access authorization certificate is signed with the signing key of the ID database and checked with the corresponding signature verification key of the individual database.
- the credentials are sent to the individual database in signed form.
- the individual database uses the signature verification key to check whether the signature of the credentials is valid, the establishment of the database connection and data record access being permitted only if the signature is valid.
- a central ID database is provided for the entire distributed database system.
- This can be advantageous because it is not necessary because not every single database has to be equipped with program logic for the certificate chain check.
- the described method or the described database structure can be used to achieve an extensive separation of data management (in individual databases) and the management of access rights (in the ID database).
- the ID database contains and / or is operatively linked to one or more program modules, for example for certificate chain checking (for example from user certificates to the root certificate from the certification authority issuing the user certificate), for storing changes to the Allocation of certificates and users in a log file and for the dynamic creation of signed credentials, taking into account a documented chronology of the transfer of rights.
- the individual individual databases only have means for checking whether appropriate authorizations exist for the individual database itself or for the data records contained therein or are operatively linked to them, these means possibly including means for signature checking.
- each of the data records in the source database contains one or more access certificates of the user creating the data record or of an entity to which the corresponding data record is assigned.
- the access certificates in the ID database are each assigned one or more user certificates in such a way that the chronological sequence of users who have granted themselves one or more of the access rights of the user creating the data record each represent in the form of a hierarchy is. This assignment can e.g. take place by means of access authorization chain objects.
- an “access management system” is understood to mean an electronic system for storing and retrieving data.
- the access management system can be a “classic database management system” (DBMS) (MySQL, PostgreSQL, Oracle, Hana, etc.).
- DBMS database management system
- the data is stored in a microcontroller memory are stored and managed by an application program or a chip-based program logic that works as an access management system and is not a classic DBMS.
- the data are preferably stored consistently and permanently in the access management system, and various application programs and users are efficiently made available in a needs-based form.
- An access management system can typically contain one or more databases and manage the data records contained therein.
- an “ID database” is understood to mean a database which contains and manages user-related information such as user certificates and the rights assigned to these users in the form of further certificates (access certificates).
- an ID database is mainly used to manage the owners assigned to users with regard to user data - and access rights.
- At least one index access authorization certificate is stored as part of the additional data record and / or the supplementary data record for the first entity.
- a read access authorization, a write access authorization and an index access authorization are stored as part of the additional data record and / or the supplementary data record for the first entity.
- an index access credential for the additional data record is also assigned to a second entity assigned to the receive database.
- the second entity is a second user of the receive database or a second computer system assigned to the receive data.
- a prerequisite for sending the supplementary data record to the receiving database is a successful check that the second entity is assigned an index access authorization for the supplementary data record. It can thus be ensured that the receive database is actually authorized to access corresponding index information such as the supplementary data record.
- the data processing function also includes classifying the tokens generated from the additional data record.
- the source database for the classification comprises a pre-trained learning module for machine learning, the pre-trained learning module comprising a plurality of predetermined trigger definitions which define triggers for assigning tokens to classes of a first group of classes, the first tokens in the index of the source database, which are included in one of the trigger definitions of the source database as a trigger, are each assigned to the corresponding trigger definition, with second tokens in the index of the source database being assigned to one or more classes of the first group of classes ,
- classifying comprises:
- the learning module identifies the corresponding token as a trigger
- the learning module uses the identified triggers to assign one or more second additional tokens to one or more classes of the first group of classes if the corresponding second additional tokens from the additional data set are in combination with one or more of the identified triggers are included in accordance with one of the trigger definitions, the corresponding triggers triggering a corresponding class assignment in accordance with the corresponding trigger definition, the index being supplemented by the first multi-model database management system using the class assignments of the additional tokens.
- Embodiments can have the advantage that the learning module is a pre-trained learning module.
- the pre-trained learning module comprises a plurality of inertially provided or defined trigger definitions.
- the learning module is configured to classify all tokens included in the database or the index using these inertially defined trigger definitions.
- Embodiments can have the advantage that no coincidence enters the decision-making or classification process. Rather, the classification of tokens is based on the predetermined trigger definitions and can therefore be traced at any time. Even if the learning module advances, for example, on the basis of the classification and learns further patterns and regularities in the course of a learning transfer based thereon, the underlying classification goes back to the predetermined trigger definitions.
- meta and / or context information on the classified tokens is provided in the form of the classification.
- This meta and / or context information is identified on the basis of the triggers in accordance with the trigger definitions and assigned to the corresponding tokens in the form of the class assignment.
- the learning module can be configured to learn further patterns and regularities using this meta and / or context information.
- Embodiments can have the advantage that the data records received from the source database are all stored in their original form in the document-oriented data model. This ensures that the full information content of these data records is retained.
- the data comprised by the data records stored in the document-oriented data model are provided in the form of the index.
- This index comprises the corresponding data of the document-oriented data model in form of tokens.
- the index includes all of the elementary data elements included in the document-oriented data model in the form of elementary tokens.
- the index comprises data elements derived from the tokens.
- the index additionally comprises combinations of the elementary data elements in the form of token combinations comprised by the document-oriented data model. These token combinations each include a combination of a plurality of elementary tokens.
- the index comprises token combinations up to a predetermined complexity. The complexity of a token combination is defined, for example, by the number and / or type of elementary tokens it comprises.
- the tokens included in the index can be triggers according to the predetermined trigger definitions, for example. If a corresponding token is generated for the first time, for example in the course of processing an additional data record, it is identified as a trigger using one of the trigger definitions, added to the index and assigned to the corresponding trigger definition. If the learning module recognizes the same token, which the index defines as a trigger, within a further data set, the learning module accesses the trigger definition assigned to triggering in the index and assigns one or more of the following to the corresponding trigger definition Token from a context of the token stored as a trigger in the further data record is assigned one or more classes of the first group of classes.
- the index comprises a plurality of tokens which are each assigned to one or more classes of the first group of classes.
- the assignment to the classes provides meta and / or context information on the corresponding tokens.
- the corresponding meta and / or context information can be used, for example, for processing the corresponding tokens and / or the data records comprising the corresponding tokens in the document-oriented data model.
- the corresponding meta and / or context information is included in the course a search query to identify relevant tokens and / or data records or in the course of a further method using the index for machine learning.
- additional patterns and regularities can be learned in the course of a further learning transfer.
- This further method for machine learning is carried out, for example, by the learning module using the trigger definitions or a further learning module.
- the further method for machine learning is a KI method that is executed by a KI module.
- remaining tokens that are assigned neither to the trigger definition nor to one or more classes of the first group of classes are assigned to a collection class in the index of the source database for identifying the corresponding remaining tokens as unknown data
- the assignment to the trap class excludes an assignment to one of the trigger definitions as well as an assignment to one of the classes of the first group of classes
- classifying further comprises:
- the index also includes tokens that do not fall under any of the predetermined trigger definitions. These tokens are neither triggers nor can they be assigned to the trigger classes defined by the trigger definitions. Rather, these tokens are unknown data which cannot be assigned and for which meta or context information is missing. These tokens are assigned to a collection class as unknown data. An assignment to the collection class excludes an assignment to one of the trigger definitions as well as an assignment to one of the classes of the first groups of classes. Embodiments can have the advantage that, on the basis of the token assignments, it can be recognized in a simple form which tokens are unknown data and which tokens are known data, ie triggers or classifiable data, acts.
- search queries can be defined in such a way that they only take into account known data.
- Additional learning algorithms can be configured, for example, in such a way that they work exclusively on known data.
- the use of chance can be used in a decision-making or classification process, even if additional learning algorithms are used.
- the basis for all learning processes and / or AI processes is provided by the initially defined triggers, which are used to classify the data received from the database.
- the predetermined trigger definitions offer a basis for monitored learning.
- Embodiments can furthermore have the advantage that additional data records that are added to the source database are each analyzed to determine which of the data they comprise is known data and which data is unknown data.
- known data is understood to mean data which are known as triggers, for which meta or context information is available and / or for which meta or context information is derived from the context of the data records using the trigger definitions can.
- Data that are neither triggers nor data that can be classified using the trigger definitions are unknown data. Unknown data is assigned to the trap class.
- Embodiments can have the advantage that a database system optimized for machine learning is used.
- the corresponding database system includes all of the data on which machine learning is based, ie trigger definitions used to classify data as well as the data which are processed using the trigger definitions. This enables continuous learning by the learning module, taking into account all of the data seen by the system or the learning module.
- the source database saves all data records received in a document-oriented data model.
- a document-oriented data model means that the data model does not impose any structural requirements on the data to be saved. Rather, the data are stored in documents or data containers in the form in which they are received. In this sense, the data stored in the document-oriented data model are raw data.
- Raw data means that the data are stored in the form in which they are received, without additional data processing by the database management system, in particular no restructuring of the data.
- Embodiments can have the advantage that the entire information content of the received data can be (almost) completely retained without the assumptions of the database management system being included. Both the database management system and the learning module can access the original data at any time and take them into account in further processing.
- An index is generated based on this data pool of raw data provided by the document-based data model. Structural information or contexts of meaning are only extracted from the data sets at this level. This structural information or context is taken into account in the form of class assignments of the indexed data.
- the data records are broken down to an elementary level by tokenization, which takes into account the elementary components of the data records in the form of tokens.
- the tokens are assigned by the learning module as a trigger to one of the trigger definitions or classified as using the trigger definitions. All tokens that are neither identified as triggers nor can be classified using one of the trigger definitions, are assigned to the collection class as unknown data.
- the learning module includes a classifier and is configured to classify the tokens using the predetermined trigger definitions.
- the corresponding classification can, for example, be part of a pattern recognition in which a feature extraction is implemented by the tokenization. Based on this extraction of features, however, there is no feature reduction in the classic sense, since the complete database is indexed and thus each token is recognized as a trigger or assigned to a class, at least the collection class.
- each token in the index is linked to one or more pointers that indicate in which data records the corresponding token occurs. This means that the raw data relevant to a token can be accessed at any time and this raw data can be used for evaluation with regard to this token.
- Embodiments can have the advantage that the use of the structures and regularities determined by the learning module in the data records, which are reflected in the token assignments, is based on the use of the predetermined trigger definitions. Unknown data, on the other hand, are recorded as such and left out until they can also be classified and thus viewed as reliable facts.
- Such an additional classification can be implemented, for example, by additional trigger definitions.
- additional trigger definitions can be added to reduce the amount of tokens included in the collection class. The method thus enables learning and / or classification with reservations.
- Embodiments can therefore have the advantage that they allow the learning module to work on the entire available database.
- they can have the advantage of continuous learning enable, which takes into account both additional data sets and already saved data sets.
- Embodiments can therefore have the advantage that they are not restricted to an arbitrary subset being picked from an available total data set in order to train on it. Rather, all of the data contained in the database are processed using the trigger definitions.
- By adding to the trigger definitions according to embodiments, it can also be achieved that all tokens are either identified as triggers or classified using the (added) trigger definitions. If unknown data is excluded from search queries and / or further learning processes, this exclusion is not arbitrary, but based on the trigger definitions provided.
- Embodiments can have the advantage that no random initialization is required, as is the case with known self-learning systems, e.g. neural networks. Rather, the initialization is based on the predetermined trigger definitions. Because of the random moment resulting from this random initialization, the decisions / classifications of a corresponding neural network are not transparent and cannot be traced. In contrast, embodiments can have the advantage of being completely deterministic.
- Embodiments can have the advantage that an already trained system, ie the pre-trained learning module, is retrained or trained further.
- Trigger definitions can be added, removed or changed. In this way, for example, the classes used in the classification can be added, removed or changed. If trigger definitions are added, removed or changed, then all assignments of tokens based on these to the corresponding trigger definitions or to one of the classes must be adapted accordingly.
- the learning module implements an algorithm for machine learning, the method not being restricted to a specific algorithm.
- the machine learning algorithm comprises at least one classification algorithm for classifying tokens.
- Machine learning can be monitored or unsupervised learning.
- the machine learning can include a classification and / or a regression analysis.
- a learning algorithm tries to find a hypothesis or a mapping that assigns the (assumed) output value to each input value. If the output values to be assigned are present in a continuous distribution, the results of which can assume any quantitative values in a given range of values, then this is generally referred to as a regression problem. If, on the other hand, the output values to be assigned are in discrete form or if the values are qualitative, this is generally referred to as a classification problem.
- the machine learning is based on the classification of the indexed tokens.
- the learning module comprises an algorithm specially developed for machine learning, such as, for example, without being limited to a density-based multidimensional outlier detection ("local outlier detection"), a Random Forrest algorithm, a neural network, a support vector machine, a naive Bayes classifier or a feedback similar to the feedback of a linear or non-linear controller.
- a multi-model database is understood here to mean a database which is configured to support a plurality of different data models.
- a multi-model database is thus configured to store, index and query data in more than one data model.
- Data models are, for example, relational, column-oriented, document-oriented, graph-based, key-value-based etc.
- a database model defines the structure in which data is stored in a database system, ie in which form the data is organized, stored and processed.
- a database is understood to be a (typically large) amount of data that is managed in a computer system by a database management system (DBMS) according to specific criteria. The data is organized in a large number of data sets.
- DBMS database management system
- a database management system or DBMS is understood to mean an electronic system for storing and retrieving data.
- the data are preferably stored consistently and permanently in the DBMS, and various application programs and users are efficiently made available in a needs-based form.
- a DBMS can typically contain one or more databases and manage the data records contained therein.
- the DBMS can preferably be a field-oriented DBMS, that is to say a DBMS that is configured to store parts of individual data records, so-called field values, in several different fields.
- a data record is understood to be a coherent set of data provided to the database system, which is managed by the database management system as a coherent set of data.
- a data record comprises, for example, a set of content-related data.
- data sets are stored in the document-oriented data model as coherent data sets.
- a single data set can refer to a particular physical object, e.g. a natural person or a device. The person can e.g. be an employee, a patient, a customer, etc.
- the device can be, for example, a production device, a computer device, a computer or network element or a transport device.
- the corresponding data record can contain a predefined set of attribute values for this person or device (e.g. name or pseudonym, age, height, weight, date of birth, ID numbers, security certificates, authentication codes, biometric data, identifier, date of entry, commissioning date, configuration data , and other).
- a data record can represent a group of content-related data fields (belonging to an object), e.g. B. Item number, Item size, item color, item name or something similar.
- the classes' name ', address'and' date of birth 'could form the logical structure of a data record for the object type “person”.
- data is stored in the form of data records in databases, whereby they are the subject of the processing of computer programs and are generated, read, changed and deleted by these.
- NoSQL (English for Not only SQL) DBMS is a DBMS that follows a non-relational approach to data storage and does not require any fixed table schemes.
- the NoSQL DBMSs include in particular document-oriented DBMSs such as Apache Jackrabbit, BaseX, CouchDB, IBM Notes, MongoDB, graph databases such as Neo4j, OrientDB, InfoGrid, HyperGraphDB, Core Data,
- ACID DBMSs such as MySQL Cluster
- key-value databases such as Chordless, Google BigTable, GT.M, InterSystems Cache, Membase, Redis, sorted key-value stores, multivalue databases
- Object databases such as Db4o, ZODB, column-oriented databases and temporal databases such as Cortex DB.
- An index is a data structure which accelerates a search for certain data values by a database management system.
- An index consists of a collection of pointers (references) that define an order relation to several “indexed” data values (stored in the index). For example, B + trees are used for this.
- Each indexed data value is linked with further pointers which refer to data records in which the indexed data value found is contained and which represented the database for creating the index.
- Database management systems use indices to quickly identify the desired data records in response to a search query by first searching the index along the pointer for a data value which is identical to a reference value contained in the search query.
- the data values of a field managed by the DBMS would have to be searched sequentially, while a search using the index, for example a B + tree, often only has logarithmic complexity.
- the index also assigns the indexed data, ie tokens, to classes, as a result of which the corresponding data are linked with meta or context information.
- This meta or context information can be used in a search and / or in a machine learning process on the data in the database.
- a field is an area on a logical or physical data carrier which is managed by a DBMS, which is assigned to a predefined field type and which is created and intended for storing a field value of a data record.
- a field is therefore an element for storing a field value of a data record as defined above. Fields of a data record are managed jointly by a DBMS.
- a field value is a data value that is part of a data record and is stored in a field of the data record.
- a field value can consist of a single word, a single number, or a combination of several words and / or numbers and / or other data formats, with different embodiments of the invention varying degrees of flexibility with regard to the type and combinability of data types within of the same field value.
- a "tokenizer” is a program logic that receives data, for example a field value, as input, analyzes the data, for example to identify delimiters or other decomposition criteria and patterns, and then breaks the data down into one or more tokens as the result of the analysis and returns the tokens. It is also possible that not all data will be returned as tokens. For example, a full text indexer can recognize semantically insignificant stop words and filter them out so that they are not indexed. Alternatively, all data is returned as.
- To “tokenize” a data value means breaking the data value into several components according to a certain scheme. The components represent the tokens.
- natural-language texts can be divided up using predefined separators, such as spaces, periods or commas, and the components (words) generated in this way are used as tokens.
- tokens are all tokens used for indexing. It is also possible that some tokens are not used for indexing (e.g. stop words) or that the tokens are additionally processed prior to indexing (e.g. reducing words to the stem).
- the search value is preferably processed in the same way by the client computer system or the server computer system to ensure that the search values of the search queries match those in the index contained tokens.
- a class defines a category or a type to which a token belongs.
- the class therefore assigns meta or context information to the token, for example in the form of a property.
- a class can represent a certain attribute of a physical object in the form of a token.
- data records to be saved that contain employee attributes, which classes such as "Name”, “Pseudonym”, "ID number”, "Access certificate for room R”, “Access certificate for device G”, “Access certificate for building GB”, Represent "age”.
- Each token can be assigned to one or more classes.
- combinations of tokens can in turn be assigned to one or more further classes as independent tokens.
- the data records received are stored using a document-oriented data model. For example, all field values of the stored data records are transferred as tokens into a multi-dimensional key / value store (key / value store) or key-value databases.
- the tokens are assigned token types and stored in a form that meets the sixth normal form.
- the transaction time and the validity time of the data records are also stored bit-temporally.
- the transaction time indicates the point in time at which a change to a data object in the database occurs.
- the validity period specifies a point in time or period in which a data object is in the modeled Image of the real world has the described state. If both the validity and the transaction time are relevant, one speaks of bitemporal.
- a key-value data model enables storage, retrieval and management of associative data fields. Values are uniquely identified using a key.
- a document-oriented data model also known as a document store
- documents or data containers form the basic unit for storing the data.
- a document-oriented data model enables document-oriented information, also known as semi-structured data, to be stored, retrieved and managed.
- Databases based on a document-oriented data model belong to the NoSQL databases and form a sub-class of key-value stores.
- key-value store the data are viewed as inherently opaque for the database, while a document-oriented database relies on internal structures in the stored documents in order to extract metadata.
- the semi-structured data model is a database model in which there is no separation between the data and the schema and the scope of the structure used depends on the purpose of the database. Each document within the data model is addressed via a unique identifier.
- a combination of the different database concepts enables data sets to be saved as documents or containers (document disturb) and additionally in the form of an index, e.g. of a key-value memory to be converted into the 6th normal form.
- This key-value memory represents the entire data volume in the document memory, while the original data records are retained.
- selections are carried out exclusively in the key-value memory in the redundancy-free sixth normal form. Only the result is read from the document storage container.
- a selection right is also implemented on the key-value memory. This means that you can work on the index alone without having to read out the underlying data.
- the proposed multi-model database thus provides a complete normalization of the entire scope of data in the sixth normal form in addition to a schematic data storage on the basis of a document memory.
- Embodiments can have the advantage that the index contains data elements of the data records, i.e. Token, included as a key and each of these keys is assigned one or more pointers as values which indicate in which data records and / or fields of the data records the corresponding key, i.e. Token / data value, saved as field value.
- This index therefore forms all fields of the data records and their contents, i.e. the field values from the entire database with all of the data records it encompasses, so that all queries are handled in the index and the data of the document-oriented data model stored without a schema are only used to output the search results.
- the small size of the index compared to the schema-less data enables quick queries in any query combination.
- a computer or computer system is understood here to mean a device which processes data by means of programmable arithmetic rules.
- a program or program instructions is understood here without restriction to mean any type of computer program which includes machine-readable instructions for controlling a functionality of a computer.
- a computer or computer system can comprise a communication interface for connection to the network, wherein the network can be a private or public network, in particular the Internet or another communication network. Depending on the embodiment, this connection can also be established via a cellular network.
- a computer system can be a stationary computer system, such as a personal computer (PC) or a client or server integrated in a client-server environment.
- a computer system can be, for example, a mobile telecommunication device, in particular a smartphone, a portable computer, such as a laptop PC or palmtop PC, a tablet PC, a personal digital assistant or the like.
- a corresponding computer system it can also be an object of the Internet of Things (“smart device”), e.g. a so-called “wearable” or
- wearable computers i.e. portable electronic devices or portable computer systems which are arranged on the body of the user during use.
- wearables are smartwatches, i.e. Watches with computer functionality and connectivity, activity trackers, i.e. Device for recording and sending fitness and / or health-related data, smart glasses, i.e. Glasses whose insides serve as a screen or items of clothing in which electronic communication aids are incorporated.
- a computer system can be a computer system of a means of transport, such as a car, airplane, ship or train with an on-board computer.
- a computer system can be a control computer of a smart home system, an access point or router of a local WLAN network, a multimedia device with computer functionality and connectivity, such as a smart TV, a control system Locking system or a "smart device” or intelligent objects, ie an everyday object with information technology, which receives added value through sensor-based information processing and communication.
- a multimedia device with computer functionality and connectivity such as a smart TV, a control system Locking system or a "smart device” or intelligent objects, ie an everyday object with information technology, which receives added value through sensor-based information processing and communication.
- a memory is understood here to mean both volatile and non-volatile electronic memories or digital storage media.
- a non-volatile memory is understood here to mean an electronic memory for the permanent storage of data.
- a non-volatile memory can be configured as a non-changeable memory, which is also referred to as read-only memory (ROM), or as a changeable memory, which is also referred to as non-volatile memory (NVM).
- ROM read-only memory
- NVM non-volatile memory
- it can be an EEPROM, for example a Flash EEPROM, referred to as Flash for short.
- Flash Flash for short.
- a non-volatile memory is characterized by the fact that the data stored on it are retained even after the power supply has been switched off.
- a volatile electronic memory is a memory for the temporary storage of data, which is characterized in that all data is lost after the power supply is switched off.
- this can be a volatile random access memory, which is also referred to as a random access memory (RAM), or a volatile main memory of the processor.
- RAM random access memory
- a processor is understood here and in the following to be a logic circuit that is used to execute program instructions.
- the logic circuit can be implemented on one or more discrete components, in particular on a chip.
- a processor is understood to mean a microprocessor or a microprocessor system made up of several processor cores and / or several microprocessors.
- the corresponding additional token is supplemented under its class assignments in the index, and if one of the class assignments of an additional token included in the index is not included in the index supplementing the corresponding class assignment with the corresponding additional token in the index and linking the corresponding additional token in the index with the pointer to the additional data record stored in the document-oriented data model.
- the corresponding class assignments are added.
- the pointer to the additional data record is added to the index for these tokens.
- Embodiments can have the advantage that it can always be ensured that the index has all tokens comprised by the data records of the corresponding individual database.
- the index includes all class assignments found for all corresponding tokens.
- each of the tokens of the index is linked with a pointer to all of the data records of the corresponding individual database which contain the corresponding token.
- an initial set of predetermined trigger definitions is established.
- data records are received and stored in the document-based data model.
- the stored data records are tokenized and class assignments are determined for the resulting tokens using the initially defined trigger definitions and an initial index is generated for the resulting tokens.
- the initial index includes all triggers included in the trigger definitions as tokens.
- tokens determined as triggers by the trigger definitions are only added to the index on the condition that they are included in one of the data records.
- Allocation of a token to a class using a predetermined trigger function sets a by the corresponding predetermined trigger Function secured fact. For tokens which are not a trigger and which are not covered by any of the trigger definitions, there is a lack of such factual knowledge. Instead, the corresponding tokens are assigned to the trap class as unknown data. Embodiments can thus have the advantage that, using initially established trigger definitions, new data can be divided into known data, ie triggers or tokens classified using trigger definitions, and unknown data, ie the collection class assigned tokens.
- the combinations of second additional tokens with one or more of the identified triggers that have triggered a class assignment according to one of the trigger definitions are identified in the index as classified combinations and class assignments are only made for combinations of the second additional tokens and one or more identified triggers, which are not identified as classified combinations.
- Embodiments can have the advantage that for all token combinations for which a class assignment has already been taken into account or for which a class assignment has already been carried out, are each identified in the index as already classified. It can thus be avoided that the same classifications are carried out again for token combinations which the learning module has already seen and fully taken into account in the course of the classifications.
- the system can thus be designed to be significantly more efficient.
- the index includes all token combinations for which a classification has already taken place, ie all token combinations which are to be marked as classified.
- the corresponding token combinations in the index are each provided with a flag which indicates whether the corresponding token combination is classified token combinations.
- a comparison with all token combinations marked as already classified takes place first.
- the classification is not repeated for these token combinations; rather, there is only a link with the pointer to the additional data record.
- the corresponding pointer is also linked to all of the tokens included in the token combination in the index.
- the comparison first takes place with the largest, ie most extensive, token combinations of the index. For all token combinations of the additional data record already recognized as classified, only the pointer to the corresponding data record is stored in the source database. According to embodiments, the corresponding pointer is also linked to all of the tokens included in the token combination in the index.
- a comparison with further token combinations takes place successively, with the size or the scope of the further token combinations used gradually decreasing. According to embodiments, only those further token combinations with a smaller size or scope are taken into account which, as part of a larger or more extensive token combination, a match was not found in the course of the comparison.
- Embodiments can have the advantage that for extensive token combinations which are recognized as already classified, no additional comparison takes place for sub-combinations comprised by the corresponding token combination. Rather, a corresponding comparison only takes place if the corresponding sub-combination is included in the additional data record as an independent token combination, independent of the corresponding more extensive token combination.
- the method further comprises:
- Embodiments can have the advantage that, based on the triggers identified by the initial trigger definitions, additional triggers can be identified in the form of trigger combinations. Based on these identified trigger combinations, combined trigger definitions can be determined from the initial trigger definitions, with which the majority of the predetermined trigger definitions can be expanded.
- token combinations which are included in the same data record and fall under the combined trigger definition are combined with one another and the resulting combination is identified in the index as a classified combination.
- Embodiments can have the advantage that on the basis of combined trigger definitions, token combinations in the index can be identified as classified combinations, thereby avoiding unnecessary repetitions of classifications of already classified token combinations.
- the combination criterion includes a minimum frequency for the corresponding trigger combination to occur in the data records.
- Embodiments can have the advantage that corresponding trigger combinations are only used to form a combined trigger definition when the corresponding trigger combination occurs in the data records with a minimum frequency. This prevents additional combined trigger definitions from being created due to the accidental occurrence of triggers with different trigger definitions in one and the same data record. Such a random occurrence is from a certain size and / or Complexity of the data sets can be expected without it being possible to draw conclusions about an underlying relationship between the triggers. However, if the corresponding trigger combinations occur more frequently, a connection can be concluded from them.
- the minimum frequency defines an absolute frequency value of the occurrence in the data records.
- the corresponding minimum frequency can be a minimum value for the occurrence of the corresponding trigger combination in all data records. The occurrence of the corresponding trigger combination is added up across all data records. If the resulting sum is greater than or equal to the minimum value, this is fulfilled.
- the minimum frequency can be a minimum value for the occurrence in one of the data records. The occurrence of the corresponding trigger combination is summed up individually for the individual data records. If one of the resulting sums meets the minimum value, the minimum frequency is present.
- the minimum value must be fulfilled by a predetermined number of data records or a predetermined percentage of the data records.
- the corresponding predetermined percentage is either a percentage of all data records in the database or all data records which comprise the corresponding trigger combination.
- the minimum value must be fulfilled by all data records and / or by all data records that include the corresponding trigger combination.
- the corresponding minimum frequency can be a minimum value for an average frequency of occurrence of the corresponding trigger combination in all data records in the database or in all data records which comprise the corresponding trigger combination.
- the minimum frequency defines a relative frequency value of the occurrence in the data records.
- the corresponding minimum frequency is dependent on the number of data records and / or the number of tokens and / or the size of the data comprised by the data records. For example, the frequency value specified by the minimum frequency grows with it the number of data records and / or the number of tokens and / or the size of the data comprised by the data records.
- the minimum frequency stipulates a relative frequency value of the occurrence in the data records relative to the frequencies of occurrence of one or more of the triggers comprised by the corresponding trigger combination in the data records.
- the relative frequency value is dependent on the occurrence of the trigger with the highest frequency of an occurrence, the trigger with the lowest frequency of an occurrence and / or an average value of the occurrence of all triggers of the corresponding trigger combination.
- Embodiments can have the advantage that, when a relative frequency value is taken into account, the frequency of occurrence of one or more of the triggers comprised by the corresponding trigger combination is included in the decision-making process as to whether an additional combined trigger definition is based on the corresponding trigger combination is to be supplemented, is included.
- the frequency of occurrence of the corresponding triggers can relate to an occurrence of the corresponding triggers in all data records, to an average occurrence in all data records, to a most frequent occurrence in one of the data records and / or relate to a minimal occurrence in one of the data records.
- Embodiments can have the advantage that the relative frequency value is selected to be higher, the higher the frequencies of occurrence of the one or more corresponding triggers comprised by the trigger combination. It can thus be avoided that a trigger definition is generated on the basis of a trigger combination, the occurrence of which is random, i.e. whose triggers happen to be included in the same data set, without this indicating a connection between the corresponding triggers.
- the combination criterion includes one or more conditions at relative positions of the triggers of the corresponding trigger combination to one another within one of the data records.
- Embodiments can use the Have the advantage that a relative position of the triggers of the corresponding trigger combination within the data set is taken into account for the combination criterion.
- a corresponding relative position of data within data records results from or is dependent on contextual relationships. Corresponding contextual relationships can therefore be read from the relative position.
- the relative position can be a relative position in a one-dimensional, ie sequential, data structure such as a text or speech file, a two-dimensional data structure such as an image file , or a higher-dimensional, for example three-dimensional or n-dimensional, data structure.
- the trigger definitions each comprise a definition of a trigger structure which positions relative to one another for one or more triggers included in the corresponding trigger definition and one or more tokens to be assigned to one of the classes according to the corresponding trigger definition specifies.
- Embodiments can have the advantage that a corresponding trigger definition using one or more triggers defines how one or more tokens are to be classified as a function of a relative position of the corresponding tokens to the corresponding triggers.
- the corresponding relative position can be a relative position in a one-dimensional, two-dimensional or higher-dimensional, for example three-dimensional or n-dimensional, data space.
- the stipulations of the relative positions include at least one of the following stipulations: the one or more tokens to be assigned are arranged according to a trigger included in the corresponding trigger definition, the one or more tokens to be assigned are before one of the corresponding triggers -Definition included triggers arranged that have an or Several tokens to be assigned are each arranged between triggers included in the corresponding trigger definition.
- a trigger can, for example, trigger a classification of preceding data, e.g. "[before1] [Trigger1]”.
- the occurrence of the trigger “Trigger1” triggers a classification of the preceding data “before1”.
- the trigger itself is part of the classification, i.e. the combination "[before1] [Trigger1]” is classified.
- the trigger “Trigger1”, if it is recognized, is assigned as a trigger to the corresponding trigger definition.
- a trigger can, for example, trigger a classification of subsequent data, e.g. "[Trigger2] [after1]”.
- the occurrence of the trigger “Trigger2” triggers a classification of the subsequent data “after1”.
- the trigger itself is part of the classification, i.e. the combination “[Trigger2] [after1]” is classified.
- the trigger “Trigger2”, when it is recognized, is assigned as a trigger to the corresponding trigger definition.
- a trigger can, for example, trigger a classification of preceding and following data, e.g. "[before2] [trigger3] [after2]”.
- the occurrence of the “Trigger3” trigger triggers a classification of the preceding data “before2” and the following data “after2”.
- the trigger itself is part of the classification, i.e. the combination “[before2] [trigger3] [after2]” is classified.
- the trigger “Triggers”, if it is recognized, is assigned as a trigger to the corresponding trigger definition.
- a combination of two or more triggers can, for example, trigger a classification of preceding, following and data arranged between the triggers, eg “[before3] [trigger4] [between1] [trigger5] [after3]”.
- the combination of the triggers “Trigger4" and “Trigger5” a classification of the preceding data “before3”, the following data “after3” and the data in between “in between 1”.
- the triggers themselves are part of the classification, ie the entire combination “[before3] [trigger4] [between1] [trigger5] [after3]” is classified.
- the triggers “Trigger4” and “Trigger5” are assigned as triggers to the corresponding trigger definition when it is recognized.
- a trigger combination can comprise any number of triggers, e.g. ,, [before4] [trigger6] [between2] [trigger7]
- a trigger combination of triggers “Trigger6” to “Trigger6 + (N + 1)” triggers a classification of the preceding data “before4”, the following data “after4” and the data in between “between2” to “in between” - between2 + N “.
- the triggers themselves are part of the classification, i.e.
- the entire combination is classified “[before4] [trigger6] [between2] [trigger7]
- a trigger [between2 + N] [trigger6 + (N + 1)] [after4].
- the formulation "May over” is a first trigger [Trigger1] and with the formulation "and” around a second trigger [Trigger2]
- the structure thus corresponds to a structure of the form [before] [Trigger1] [between] [Trigger2] [after].
- previous data [before] classified as an identity, and data in between [in between] and subsequent data [after] are each classified as identities.
- the formulation "The customer bears the damage" is a trigger.
- the structure therefore corresponds to the structure [trigger] [after].
- the subsequent data [after] is classified as a condition.
- a trigger definition can stipulate that a token which is located within a radius around a specific trigger in an n-dimensional data space is to be assigned to a specific class.
- a token which is located within a radius around a specific trigger in an n-dimensional data space is to be assigned to a specific class.
- it can also be decisive for the class assignment in which spatial direction the token is correspondingly rejected by the trigger. This can for example be defined by a vector which defines the relative position of the token to the trigger.
- a trigger definition can stipulate that a token which is arranged within a plurality of radii around one trigger of a plurality of triggers is to be assigned to a specific class.
- n-dimensional areas delimited by the individual radii overlap and delimit an n-dimensional or lower-dimensional intersection area in the n-dimensional data space.
- a token which is part of this n-dimensional or lower-dimensional intersection area is assigned to a certain class, for example.
- a maximum trigger distance is defined for the triggers in accordance with the trigger definitions, which defines a maximum distance relative to the corresponding trigger to which a trigger effect of the trigger is limited.
- Embodiments can have the advantage that the corresponding maximum distance is a radius around the corresponding trigger in an n-dimensional data space.
- the trigger effect is limited to the corresponding maximum trigger distance in front of and behind the corresponding trigger.
- the trigger effect is limited to a two-dimensional circular area around the corresponding trigger.
- the trigger effect is limited to a spherical volume around the corresponding trigger.
- the trigger effect is limited to a volume of an n-dimensional sphere around the corresponding trigger.
- the maximum distance can depend on the spatial direction and be set to be of different sizes in different spatial directions.
- the maximum trigger spacing is identical for all triggers. According to embodiments, the maximum trigger spacing is identical for a subset of the triggers. According to embodiments, the maximum trigger spacing is determined individually for each trigger.
- the corresponding maximum trigger distance can be a distance in a certain unit, depending on the type of data. For example, a sequential sequence in time is a time interval measured in a time unit, such as milliseconds, seconds or minutes.
- a one-dimensional, two-dimensional or three-dimensional spatial data structure is a spatial distance in a spatial unit, such as millimeters, centimeters, decimeters or meters.
- the distance can be based on pixels or voxels, for example. A corresponding distance can thus be, for example, a number of pixels or a number of voxels.
- the distance is a logical distance. This can for example be based on elementary data elements, such as elementary characters.
- a corresponding distance can be a number of characters, for example.
- the corresponding distance can be a number of elementary data elements composed of elements, such as a number of words. For example, the number is limited to a certain part of speech.
- the distance can be limited by logical elements in the data structure, such as a punctuation mark and / or a trigger.
- the method further comprises:
- the reclassification by the learning module corresponding to a replacement of the assignment to the collection class with an assignment to the latter includes additional trigger definition, which includes the corresponding token as an additional trigger
- the learning module uses the additional triggers to reclassify one or more tokens assigned to the collection class in the index to one or more classes of the second group of classes if the corresponding tokens assigned to the collection class are from one of the data records in a combination with one or more of the additional triggers are included and the corresponding additional triggers according to FIG corresponding additional trigger definition triggers a corresponding assignment to the one or more classes of the second group of classes.
- Embodiments can have the advantage that by adding additional trigger definitions to the learning module, the number of tokens that are assigned to the interception class can be reduced. Additional trigger definitions can be supplemented in a targeted manner in order to reclassify those tokens that are assigned to the trap class. Additional trigger definitions can therefore be supplemented as a function of the data records which the corresponding individual database comprises and the unknown data which they comprise.
- additional trigger definitions are added until all tokens of the trap class are reclassified.
- corresponding additional trigger definitions are added according to predefined intervals.
- Corresponding predefined intervals are defined in time, for example, based on the number of tokens included in the collection class, the amount of data stored in the corresponding individual database and / or the amount of data added to the corresponding individual database since the last addition.
- the second group comprises classes different from the classes in the first group.
- Embodiments can have the advantage that additional classes are defined so that those tokens of the collection class can be classified for which the meta or context information corresponding to the classes of the first group cannot be used. Rather, additional meta or context information can be defined and used by the classes of the second group.
- one or more classes of the second group are each identical to one of the classes of the first group.
- Embodiments can have the advantage that the additional trigger definitions trigger are provided, which enable an assignment of the tokens of the collection class to classes of the first group of classes.
- the trigger definitions to be supplemented are each dependent, as supplements, on a trigger definition already included in the learning module.
- Embodiments can have the advantage that one or more of the supplementary trigger definitions are defined in the form of additions to the trigger definitions already included in the learning module.
- the corresponding supplementary trigger definitions extend, for example, the trigger effect of already existing trigger definitions.
- the supplementary trigger definitions form combined trigger definitions with the already existing trigger definitions.
- the additions are carried out repeatedly following a recursive scheme, the trigger definitions to be added to each recursion level each comprising additions to a trigger definition of a preceding recursion level, so that the recursive additions form tree structures which each have one of the predetermined trigger Include definition as root node.
- Embodiments can have the advantage that the trigger effect of the existing trigger definitions can be successively expanded by a progressive recursion scheme until all tokens of the collection class are reclassified.
- the result of the corresponding additions to the already existing trigger functions can be tree structures, for example, which can be followed by a classification of tokens.
- the additional trigger definitions to be supplemented are received by the learning module.
- Embodiments can have the advantage that the corresponding trigger definitions can, for example, be provided externally, for example by an administrator. So he has corresponding administrators always have the opportunity to control, correct and add to the classification.
- an external fine adjustment can optionally or optionally take place, for example by an administrator.
- additional trigger definitions from the class of the unknown data i.e. the collection class
- tokens are extracted and assigned to existing classes and / or new classes are generated to which extracted tokens are assigned.
- an administrator provides additional trigger definitions that are applied to the collection class, analogous to the initial trigger definitions provided.
- the additional triggers are applied exclusively to the collection class and to data received in the future in accordance with the additional trigger definitions.
- the use of an additional trigger can be implemented as an IF condition. For example, if another trigger has already been successfully applied to a data record, e.g. a trigger1, and the data record also includes data classified as unknown, where an additional trigger, e.g. a trigger2, applied according to one of the additional trigger definitions.
- This fine adjustment can be repeated several times as a recursion. For example, the recursion is continued until the interception class no longer includes any tokens, ie no more unknown data exists, or the number of tokens comprised by the interception class reaches and / or falls below a predefined threshold value, ie a predefined maximum number.
- the corresponding threshold value can be an absolute value which is independent of the number of tokens included in the index and the amount of data included in the corresponding individual database.
- the corresponding threshold value can be a relative value which is dependent on the number of tokens included in the index and / or the data volume included in the corresponding individual database
- trigger trees or decision trees can arise after the initially defined triggers or trigger definitions, the number of levels depending on the number of recursions N, eg the number of levels is equal to N + 1.
- each initial trigger or each initial trigger definition forms a root point of a corresponding trigger tree or decision tree.
- a decision tree is understood here to mean ordered, directed trees that serve to represent decision rules.
- a data record includes an initial trigger, which means that part of the tokens in the data record can be classified without all tokens in the data record being able to be classified at the same time, it is checked whether the data record also includes a trigger of the first recursion. If the data record includes a trigger of the first recursion, whereby a further part of the tokens of the data record can be classified without all data of the data record being able to be classified at the same time, it is checked whether the data record also includes a trigger of the second recursion and so on.
- the additional trigger definitions to be supplemented are created by the learning module, which comprises a statistical model, the statistical model being used for a statistical analysis of the tokens included in the collection classes and their occurrence in the data records, the result of the statistical analysis is used to create the additional trigger definitions to be supplemented.
- Embodiments can have the advantage that the learning module can independently create supplementary additional trigger definitions. For example, the optional or facultative fine adjustment described above takes place using the statistical model.
- the statistical model identifies triggers within the unknown data, for example by frequency analyzes and correlation analyzes, which are then applied to the tokens classified as unknown in a manner analogous to the procedure described above.
- Embodiments can also take a recursive approach using the statistical model.
- the method further comprises:
- an administrator can recognize errors in classified classes and correct them if necessary, for example by providing a corrected trigger definition, on the basis of which tokens are reclassified.
- Embodiments can have the advantage that a correction of trigger definitions is made possible at any point in time during the method. For example, the trigger definitions can be checked after training the learning module. If correction trigger definitions are identified, correspondingly corrected trigger definitions are provided.
- Embodiments can have the advantage that corrected trigger definitions can also be provided at a later point in time when incorrect classifications are recognized. Administrative intervention in the learning and classification process is therefore possible at any time. This means that errors in the learning system can be corrected without having to convert the entire model.
- the pointers with which the tokens are stored linked in the index each refer to one or more of the field values in the stored data records.
- Embodiments can have the advantage that a finer granularity can be achieved when determining the origin of tokens in the data records. Such a finer granularity also enables the relative relationships of the tokens within the data records to be broken down and taken into account in an analysis or other use of the index.
- the field values of the additional data record include text data, image data, audio data and / or video data.
- the method can be used, for example, for signal processing, such as 1 D audio recognition, 2D and 3D image processing, or ND data input from N sensors, etc.
- the method can be used, for example, for an analysis of stream data (bitstream or bit stream).
- a bit stream also known as a bit stream, here refers to a sequence of bits that represent a flow of information, i.e. a serial or sequential signal.
- a bit stream is thus a sequence of bits of indefinite length in chronological order.
- a bit stream for example, represents a data stream divided into logical structures, which is divided into more fundamental small structures such as symbols of a fixed size, i.e. Bits and Bytes, and can be further broken down into blocks and data packets of different protocols and formats.
- generating the tokens includes applying tokenization logic to the field values of the additional data record, which logic includes a full-text indexer that is configured to break down texts into words and to output the words as tokens.
- the corresponding text files can be any text.
- the corresponding text files can be measured value files or algorithms for controlling computers and / or technical systems.
- the field values of the additional data record include full texts, the full texts including words and / or one or more numbers formed from letters of one or more alphabets.
- Full-text indexing involves breaking down texts into individual words, with the individual words of a text field then being stored in an index assigned to this field. Full text indexing is only supported if the corresponding field is configured for the selective storage of a certain data type, eg CFIAR, VARCFIAR or TEXT.
- natural language text in JSON format can be stored in a field.
- generating the tokens includes applying tokenization logic to the field values of the additional data record, which logic includes a generic tokenizer that is configured to recognize data of different data types in the field values and from these tokens in different data types to create.
- tokenization logic includes a generic tokenizer that is configured to recognize data of different data types in the field values and from these tokens in different data types to create.
- Embodiments can have the advantage that effective tokenization can be implemented for different types of data, such as text data, image data, audio data and / or video data.
- the method further comprises:
- Embodiments can have the advantage that the index of the source database can also be expanded by supplementary data records from other individual databases.
- the third individual database is, for example, an individual database which in turn receives supplementary data records from the source database. Changes to the indexes can thus be mutually exchanged between individual databases and uniform copies of a cross-database index can be implemented on a distributed database system, more precisely in the individual databases.
- the method further comprises:
- the response comprising at least: an indication of the identified tokens, one or more data records determined by analyzing pointers with which the identified tokens are linked, or one or more references to the specific data records .
- Embodiments can have the advantage that the index can be used for effective searches.
- the searches can be limited to information in the index, for example in how many or in which data records a certain search value occurs.
- an index access Authorization required.
- a search can also be carried out in this way on the data records of the local individual database, ie the individual database performing the search, although the data records are stored in their original form.
- a read access authorization is necessary according to the embodiments. In the case of data records which are stored on another individual database, these can be explicitly requested for reading. For this purpose, according to embodiments, read access authorization is again necessary.
- the learning module can search for patterns and / or regularities within the data records using appropriate search queries.
- the index stores all tokens generated from the field values of the data records of a corresponding individual database in such a way that the index contains each token only once.
- Each token contains pointers to one or more of the data records from whose field values it was generated. If an index generated according to the invention is searched for a specific search value and a token stored in the index is identified as the result of the search, which is identical to the search value, then this token uses pointers to refer to all data records that contain this token Contained at least once in at least one of their field values and which were used when the index was created.
- the data records, which represent a “hit” with regard to the search value can be identified and returned very quickly using the references, without the need for a sequential search across all data records.
- the search value further comprises a class assignment and the identification of the token within the index further requires that the identified token has the same class assignment.
- Embodiments can have the advantage that class assignments and thus meta or context information indexed with the class assignments can be taken into account in the search queries.
- triggers are identified in the index with a flag.
- the search value further includes an assignment to a trigger definition and / or a flag identifying a trigger, and the identification of the token within the index further requires that the identified token is assigned to the same trigger definition and / or the same Has flag.
- tokens which are assigned to the trap class are excluded from the search.
- Embodiments can have the advantage that the resulting search results have a high degree of reliability, since unknown data are excluded from the search.
- the method further comprises pre-training the learning module.
- the pre-training includes:
- the predetermined trigger definitions define initial triggers that are used to structure or classify received data.
- the initial triggers are specifically defined, i.e. predetermined trigger definitions. If data is loaded, these initial triggers enable an initial classification according to known classes as well as unknown data, which are assigned to the collection class.
- the pre-training further comprises:
- data e.g. Text data, audio data, image data, video data or N-dimensional data from N sensors
- the triggers applied to automatically classify the data. This results in a fragmentation of the data into triggers, known classes, i.e. classes defined by the trigger definitions, and in unknown data.
- Embodiments can have the advantage that the learning module can be effectively pre-trained in this way on the basis of the predetermined trigger definitions.
- These predetermined trigger definitions can serve as the basis for obtaining further trigger definitions, for example by combining trigger definitions.
- trigger definitions For example, there is an automatic learning phase of the database system or the learning module, which includes a combination of the initial triggers.
- the initially loaded triggers can be combined, as described above, based on the data comprised by the data records, and thus the number of trigger definitions available can be increased.
- token combinations that have already been classified can be identified. The purpose of this is to ensure that identical data that are later loaded into the source database do not have to be reclassified, but are already marked as "known" in the system.
- generating one of the additional tokens comprises using one of the field values of the additional data set in its entirety as the corresponding additional token. It is entirely possible that the index also contains tokens from fields to which no tokenization is applied or whose content simply cannot be divided into individual tokens. According to embodiments, generating one of the additional tokens comprises dividing one of the additional field values of the additional data record into a plurality of subfield values and using one of the subfield values as the corresponding additional token. Embodiments can have the advantage that the granularity of the data used or the tokenization can be adapted independently of the granularity of the fields.
- the index stores all tokens generated from the field values of the stored data records in such a way that the index contains each token exactly once for each of the token assignments of the corresponding token.
- the tokens, the class assignments and the assignment to the trigger definitions can be stored in the form of relations or equivalent structures.
- a relation is a set of tuples.
- a tuple is a set of attribute values.
- An attribute denotes a data type or a property assigned to one or more data.
- the number of attributes determines the degree, the number of tuples the cardinality of a relation.
- the document-based data model used by the multi-model database management system to store the data sets is a NoSQL data model.
- the DBMS is a NoSQL DBMS. This can be advantageous because it has been found that NoSQL DBMS in particular, which often have a more flexible structure than classic SQL-based DBMSs. Because of the flexibility of their structure, NoSQL DBMSs are particularly suitable for managing and storing data records from which an index can be created in accordance with embodiments of the invention.
- the index has the structure of a tree, in particular a B + tree.
- Embodiments can have the advantage that a tree structure, in particular the structure of a B + tree, enables a particularly efficient and fast search for the tokens stored in the index.
- a B + tree is a data and / or index structure that is an extension of a B-tree.
- the actual data elements are only stored in the leaf nodes, while the inner nodes only contain keys.
- several of the data records stored in a document-oriented data model each comprise a different number of fields.
- Embodiments can have the advantage that data sets of different sizes and structures or granularity can be processed.
- the fields each have a common, generic data format.
- Embodiments can have the advantage that since a large number of different data types can be stored in a specific field. A user or an application program who wants to save data records in the source database does not have to worry about the consistency and matching of data types. A high degree of flexibility with regard to the structure and the scope of the data records that can be managed and stored by the multi-model database management system can therefore be offered.
- the learning module or the machine learning implemented by it is configured for data extraction, consistency checking, image recognition, speech recognition, voice control, device monitoring and / or autonomous device control.
- This can, for example, already consist in the classification of the tokens, with tokens assigned to the collection class as unknown data being viewed, for example, as an indication of a potential malfunction. For example, this can be based on the index with the tokens and their meta or context information, which serves as the basis for a additional machine learning algorithm applied to it.
- the collection class is emptied by adding additional trigger definitions so that meta or context information is provided for all tokens of the database system.
- a data extraction can comprise, for example, a recognition and extraction of a pattern in a text, image, audio or video file.
- This pattern can, for example, be defined by a trigger definition or it can be recorded in the classified data.
- a corresponding pattern can be, for example, a predetermined event recorded in the form of sensor values, for example a person in an effective area of a device.
- a consistency check can include, for example, a consistency check in a text, image, audio or video file. In this case, it is checked, for example, whether the corresponding data include unknown and thus inconsistent data, whether they include data that differ greatly from the rest of the data or whether they include data that are explicitly predefined as inconsistent.
- a corresponding consistency check can be used, for example, to check for errors in control algorithms of devices, to detect malfunctions using measurement data from a function of a device, or to detect errors in text files, for example in the form of a spell check.
- Image recognition can be used to recognize objects, events or features in image or video files. For example, context information on what is visually represented is recorded and / or displayed. This can include, for example, a visual representation of information, that is, the addition of images or videos with computer-generated additional information or virtual objects by means of fading in / overlaying. Such a method is generally referred to as augmented reality. Speech recognition can be used to recognize speech in audio files or video files, for example for voice control or for converting speech into text form.
- Pattern recognition in text, image, audio or video files can be used for device monitoring. In particular, occurring or impending malfunctions can be recognized in this way. This can serve safety and enables predictive maintenance of the corresponding device, since potential problems can be identified at an early stage.
- a corresponding text file is, for example, a data record with measured sensor values.
- an autonomous device control can also be implemented, for example an autonomous control of vehicles, robots or industrial plants.
- a “device” is generally understood here to mean a technical device with sensors for acquiring status data of the device and a device computer system for logging the acquired status data.
- the device can also exist in the corresponding computer system with sensors.
- the received data sets are data sets recorded by a device computer system using the sensors.
- Computer system for machine learning A device comprises, for example, a vehicle, a plant, such as a production plant, a processing plant, a conveyor system, an energy generation system, a heat generation system, a control system, a monitoring system, etc.
- a “vehicle” is understood here to mean a mobile means of transport. Such a means of transport can be used, for example, to transport goods (freight transport), tools (machines or auxiliary equipment) or people (passenger transport). Vehicles in particular also include motorized means of transport.
- a vehicle can be, for example, a land vehicle, a watercraft and / or an aircraft.
- a land vehicle can be, for example: an automobile, such as a passenger car, omnibus or a Lorry, a motor-driven two-wheeler such as a motorcycle, moped, scooter or motor bike, an agricultural tractor, forklift truck, golf cart, truck crane.
- a land vehicle can also be a rail-bound vehicle.
- watercraft can be: a ship or boat.
- an aircraft can be, for example: an airplane or a helicopter.
- a vehicle is also understood to mean, in particular, a motor vehicle.
- the device comprises at least one sensor for detecting status data of the device.
- the state data of the device are received by the device computer system from the at least one sensor.
- the device comprises a plurality of sensors for acquiring status data of the device.
- Embodiments can have the advantage that the device's own sensor system can be used to detect the state of the device.
- the state of the device can be described, for example, through information on parameters of the current performance of the device, such as the mileage of a vehicle, consumption values, performance values, error messages, results of predefined test protocols and / or identifiers of components of the device.
- Parameters of the current performance of a vehicle can be, for example, engine speed, speed, fuel consumption, exhaust gas values, and transmission gear.
- a “sensor” is understood here to be an element for recording measurement data.
- Measurement data are data which physical or chemical properties of a measurement object, such as amount of heat, temperature, humidity, pressure, flow rate, sound field sizes, brightness, acceleration, pH value, ion strength, electrochemical potential, and / or its material properties qualitatively or quantitatively reproduce. Measurement data are recorded using physical or chemical effects and converted into an electronic signal that can be processed further. Furthermore, measurement data can be states and / or Show changes in the state of electronic devices due to external influences and / or as a result of use by a user.
- Sensors for recording status data in a vehicle can include, for example: crankshaft sensor, camshaft sensor, air mass meter, air temperature sensor, cooling water temperature sensor, throttle valve sensor, knock sensor, transmission sensor, distance sensor, transmission sensor, level sensor, brake wear sensor, axle load sensor, steering angle sensor. These sensors record and monitor the driving behavior of the vehicle. Malfunctions can be recognized and identified from deviations from target values and / or the occurrence of specific patterns. In some cases, specific causes of errors, such as failed vehicle components, can also be identified. Sensors can also query the identifiers of electronic components that are built into the vehicle in order to check their identity.
- Embodiments include a source computer system for implementing a cross-database index on a distributed database system, which comprises a plurality of computer systems comprising the source computer system, each with an independent individual database, the computer systems with the individual databases being communicatively connected to one another via a network.
- the computer systems also each include one or more processors, one or more data storage media on which the respective individual database is made available, a communication interface for communication via the network, and a program logic.
- the individual databases are each managed by a multi-model database management system, the individual databases each comprising a plurality of database-specific data records that are stored in a document-oriented first data model of the respective individual database, the stored data records each having one or more field values include, the individual field values of the stored data records are each stored in a field.
- the individual databases also each comprise a searchable first index which is stored in a second data model of the respective individual database, the index of the respective individual database comprising a plurality of tokens generated from the field values of the data records stored in the document-oriented data model of the corresponding individual database, the tokens is linked in the index with one or more pointers to one or more of the data records stored in the document-oriented data model of the corresponding individual database, from whose field values the corresponding token was generated.
- the program logic of the source computer system is configured to perform a method of implementing a cross-database index, the method comprising:
- the source computer system is configured to carry out one or more of the aforementioned embodiments of the method for implementing a cross-database index on the distributed database system.
- Embodiments further include a distributed database system for implementing a cross-database index on the distributed database system, which includes one of the plurality of computer systems each with an independent individual database, the computer systems with the individual databases being communicatively connected to one another via a network .
- the computer systems also each include one or more processors, one or more data storage media on which the respective individual database is made available, a communication interface for communication via the network, and program logic.
- the individual databases are each managed by a multi-model database management system, the individual databases each comprising a plurality of database-specific data records which are stored in a document-oriented first data model of the respective individual database, the Stored data records each comprise one or more field values, the individual field values of the stored data records each being stored in a field.
- the individual databases also each comprise a searchable first index which is stored in a second data model of the respective individual database, the index of the respective individual database comprising a plurality of tokens generated from the field values of the data records stored in the document-oriented data model of the corresponding individual database, the tokens is linked in the index with one or more pointers to one or more of the data records stored in the document-oriented data model of the corresponding individual database, from whose field values the corresponding token was generated.
- the program logic is each configured to execute a method for implementing a cross-database index by the computer system executing the program logic of the plurality of computer systems, which method acts as a source computer system, the method comprising:
- the distributed database system is configured to carry out one or more of the aforementioned embodiments of the method for implementing a cross-database index on the distributed database system.
- Figure 1 is a schematic block diagram of an embodiment of a
- Figures 2 are schematic diagrams of an embodiment of an exemplary distributed database system
- FIG. 3 shows a schematic diagram of an embodiment of an exemplary distributed database system
- FIG. 4 shows a schematic flowchart of an exemplary method
- FIG. 5 shows a schematic flowchart of an exemplary method
- FIG. 6 shows a schematic flowchart of an exemplary method
- FIG. 7 shows a schematic flowchart of an exemplary method
- FIG. 8 shows a schematic block diagram of embodiments of exemplary computer systems
- FIG. 9 shows a schematic block diagram of an exemplary assignment of rights
- FIG. 10 shows a schematic block diagram of an exemplary implementation of a system for granting rights
- Figure 1 1 is a schematic block diagram of an embodiment of a
- FIG. 12 shows a schematic block diagram of an exemplary data processing by the multi-model database management system
- FIG. 13 a schematic block diagram of an exemplary data processing by the multi-model database management system
- FIG. 14 shows a schematic block diagram of embodiments of exemplary computer systems
- FIG. 15 shows a flow diagram of an embodiment of an exemplary method
- FIG. 16 shows a flow diagram of an embodiment of an exemplary method
- FIG. 17 shows a flow diagram of an embodiment of an exemplary method
- FIG. 18 shows a flow diagram of an embodiment of an exemplary method.
- FIG. 1 shows a block diagram of an embodiment of an exemplary computer system 100, which comprises an individual database 104 for implementing a cross-database index on a distributed database system 170.
- the distributed database system 170 comprises a plurality of individual databases 104, which are each implemented on a computer system 100.
- the computer system 100 further includes a multi-model Database management system (MM-DBMS) 1 18, which manages the, possibly structured, storage of data in the at least one individual database 104 and controls all read and write accesses to the individual database 104.
- the MM-DBMS 1 18 supports at least two data models 106, 1 10, in which data are stored in the individual database 104.
- the database model defines the form in which the relevant data is organized, saved and processed.
- the first data model 106 is a document-based data model in which a plurality of data records DS1, DS2, DS3 are stored. Each data record DS1, DS2, DS3 is saved in a document or a data container. No specific structure is specified for the data records DS1, DS2, DS3 themselves when they are stored by the document-based data model 106.
- the data records DS1, DS2, DS3 can therefore be stored with the internal structure with which the data records DS1, DS2, DS3 are received from the individual database 104.
- the data records DS1, DS2, DS3 stored in the document-based data model 106 are raw data.
- the data records DS1, DS2, DS3 can include, for example, text data, image data, audio data and / or video data.
- the data records DS1, DS2, DS3 each include at least one field F1, ..., F8, with field values.
- the data records DS1, DS2, DS3 can already have an internal structure with a plurality of fields F1,..., F8 when they are stored.
- the corresponding data records DS1, DS2, DS3 then each comprise a plurality of fields F1, ..., F8.
- the data records DS1, DS2, DS3 do not have any fields even when they are received, they each include, for example, exactly one field in stored form which comprises the entire data volume of the corresponding data record DS1, DS2, DS3.
- the fields F1, ..., F8 each contain one or more field values.
- Each of the field values of a data record DS1, DS2, DS3 is stored in a corresponding field, a type of data container.
- Each field F1, ..., F8 can be assigned to a field type.
- the fields F1, ..., F8 can be assigned different or all of the same field type.
- the composition of the field values of the individual data records DS1, DS2, DS3 can vary with regard to their Differentiate between field types.
- each document comprises a field for each mandatory field type and optionally comprises one or more further fields for optional field types.
- the data of the data records are then stored in fields of the field type intended for them, ie for example text data in one or more text fields, image data in one or more image fields, audio data in one or more audio fields and / or video data in one or more video fields.
- the computer system 100 includes a tokenizer 120 for generating tokens 109.
- the MM-DBMS 118 can also include the tokenizer 122.
- the computer system 100 for example the MM-DBMS 118, has a built-in program logic or data processing function which is configured to generate an index 112.
- the corresponding index 112 is provided in a further data model 110 in which the complete data of the data records DS1, DS2, DS3 or a data volume derived from these are stored in a restructured, redundancy-free form.
- the tokenizer 122 is accessed, which is configured to tokenize the field values of the fields F1,..., F8 of the data records 106 stored in the document-based data model 106.
- the resulting tokens 109 can also be identical to a field value of a field or a data record if no further breakdown into tokens 109 is possible or useful.
- the tokenization can also take place in stages, so that an ever finer breakdown takes place.
- the resulting index 112 can therefore include tokens 109 which are composed of other tokens 109.
- the program logic or data processing function can be identical on all computer systems 100 or for all individual databases 104 of the distributed database system 170.
- the program logic or data processing function can be distributed between computer systems 100 or individual databases 104 of the Database system 170 also differentiate, for example depending on the type or the content of the data to be stored by the respective individual database.
- All or at least most of the field values of all data records DS1, DS2, DS3 of the individual database 104 are preferably tokenized, so that an extensive amount of tokens 109 is created.
- tokens 109 can include a mixture of numbers, letters, images or image segments, audio files or audio elements or other data structures, in particular sensor data from one or more sensors.
- Each of the generated tokens 109 is stored in the index 112 linked to a pointer, the pointer pointing to the data record or the field from which the token 109 originates.
- a non-redundant, unique token set is formed from the set of tokens 109, in which each of the tokens 109 occurs at most once. All tokens 109 of the non-redundant token set are preferably stored in the index 112 in such a way that the tokens 109 are sorted according to a sorting criterion and are stored in sorted form in the index structure.
- the sorting can for example take place on the basis of the alphabet for alphanumeric data or other sorting criteria adapted to the data.
- the tokens 109 are stored in the index 112 preferably in sorted form, and are furthermore preferably stored in a tree structure, it is possible very quickly to identify a specific token 109 within the index 112 and then to identify the references
- a supplementary data record 130 is created, which includes the additions made in the index 112 of the source database 104 and is used to supplement further indices of further individual databases of the distributed database system 170 as receive databases.
- the supplementary data record 130 created in this way is sent to the receiving databases of the distributed database system 170 via a network using a communication interface 126 of the computer system 100.
- FIG. 2A shows a diagram of an embodiment of an exemplary distributed database system 170.
- the distributed database system 170 comprises a plurality of nodes K 1 ,..., K 6 . These nodes K 1 ,..., K 6 are formed by computer systems 100, 200, which each comprise an individual database of the distributed database system. Each of the nodes K 1 , ..., K 6 individually aggregates data D 1 , ..., D 6 and uses them to create an index l 1 , ..., l 6 .
- the corresponding node K 1 , ..., K 6 creates an additional data record E (I 1 ), which the additions to the index I 1 , and sends this to the further accounts K 2 ,..., K 6 of the distributed database system 170, for example node K 2 .
- the further nodes K 2 ,..., K 6 of the distributed database system 170 create supplementary data records E (l 2 ) when their index is supplemented, for example I 2 , and send them to the further nodes, such as node K 2 .
- FIG. 3 shows a plurality of nodes K 1 ,..., K N of a distributed database system 170, each of which includes an individual database 1,..., N.
- each of the individual databases 1,..., N comprise a complete cross-database index of the distributed database system 170, which includes the information of all individual indices I 1 , ..., I N.
- FIG. 4 shows a flow chart for adding the indexes.
- a first node K 1 of the distributed database 170 is provided, which includes a data processing function F 1 .
- the node K 1 receives data D 1 .
- An index I 1 is generated from this data D 1 in step c) using the data processing function F 1 or an existing index I 1 is supplemented.
- the resulting index I 1 or a supplementary data record resulting from the addition of the existing index I 1 is sent from the node K 1 to at least one further node K 2 of the distributed database 170.
- the further node K 2 also includes data D 2 , from which an index I 2 is generated using the data processing function F 2 .
- the index I 2 is supplemented by the received index information I 1 of the first node K 1 , so that the resulting index of the node K 1 is a combined index l (l) which contains the index information of both indices I 1 , I 2 of both nodes K 1 , K 2 combined with each other.
- FIG. 5 shows a flow chart of an exemplary method for implementing a cross-database index on a distributed database system.
- an additional data record is received as a source database by a first individual database of the distributed database system.
- This additional data record which for example includes one or more additional field values, is used to supplement the source database.
- the additional data record is stored by a multi-model database management system of the source database in a document-oriented first data model of the source database.
- a data processing function of the source database is applied to the additional data record, which includes at least indexing of the additional data record for storage in a second data model of the source database.
- one or more additional tokens are generated from the additional data set.
- the index of the source database is generated by the first multi-model database management system using the additional tokens and a pointer to the additional tokens in the document-oriented data model of the source database added. For example, the additional tokens are compared with the index of the source database. If one of the additional tokens is not included in the index of the source database, the corresponding additional token is added to the index of the source database and linked to the pointer to the additional data set stored in the document-oriented data model of the source database. If one of the additional tokens is already included in the index of the source database, the corresponding additional token is linked in the index of the source database with the pointer to the additional data record stored in the document-oriented data model of the source database.
- a supplementary data record resulting from the application of the data processing function is created.
- This supplementary data record comprises the additions made to the index of the source database and is used to supplement at least one second index of at least one second individual database of the distributed database system as a receive database.
- the supplementary data set is sent via a network to the second individual database for integration into the second index of the receive database.
- FIG. 6 shows a flowchart of an exemplary method for integrating a supplementary data record into the index of a receive database.
- the receive database receives the supplementary data record from a source database.
- the supplementary data record includes additions made to the index of the source database.
- the supplementary data set received is integrated into the index of the receive database.
- the integration includes adding to the index of the receiving database. For example, tokens of the supplementary data record are compared with the index of the source database.
- the corresponding token is supplemented in the index of the source database and in the index with a Pointer to the data record stored in a document-oriented data model of the source database from which the corresponding token was generated. If one of the tokens of the supplementary data record is already included in the index of the source database, the corresponding token in the index of the source database is linked to the pointer to the data record stored in a document-oriented data model of the additional source database from which the corresponding token was generated.
- FIG. 7 shows a flow diagram of an embodiment of an exemplary method for carrying out a search on one of the individual databases.
- a search query is received that includes a search value.
- the index is searched for the search value and in block 504 a token is identified within the index which is identical to the search value and / or which falls below the search scope defined by the search value.
- pointers are analyzed with which the identified token or tokens are associated. In this way, one or more of the data records can be determined which contain one or more field values from which the indexed token was generated.
- a response to the search query is returned.
- This response includes, for example: an indication of the identified tokens, one or more specific data records or one or more references to the specific data records by analyzing pointers with which the identified tokens are linked. If, for example, a search is only made for which tokens meet a certain criterion determined by the search value, it is sufficient to return the tokens found without further analysis of pointers or data records. If, according to the search, the data records that comprise the tokens found are also to be identified, the pointers are analyzed, but access to the underlying data records is not necessary. Alternatively, the underlying data records can also be queried according to the search.
- FIG. 8 shows a schematic block diagram of an embodiment of an exemplary source computer system 100 and an exemplary receiving computer system 200 of a distributed database system 170.
- Both computer systems 100, 200 each include at least one processor 114, 214 which executes program instructions 116, 216 .
- the program instructions 1 16, 216 By executing the program instructions 1 16, 216, the method described above for implementing a cross-database index on a distributed database system 170 is implemented, for example.
- the processors 114, 214 each execute a multi-model database management system 118, 218 and a tokenizer 122.
- the computer systems 100 each include an individual database 104, 204 in a memory 102, 202, which is managed by the respective multi-model database management system 1 18, 218.
- the databases 104, 204 each include a first data model 106, 206, for example a document-oriented data model, in which data sets 108, 208 are stored.
- the databases 104, 204 each include a second data model 110, 210 with an index 112, 212 of all the data stored in the data records 108, 109.
- the indexes 112, 212 In order to be able to synchronize the indexes 112, 212 for different data records 108, 109 and thus to implement a cross-database index on the distributed database system 170. If, for example, a supplementary data record 130 is generated on the source computer system 100, which supplementary data record includes additions to the index 112.
- both computer systems 100, 200 each include a communication interface 126, 226 through which they can communicate with one another via a network 180.
- the source computer system 100 sends the supplementary data record 130 via the network 180 to the receiving computer system 200, which uses the received supplementary data record 130 to synchronize its index 212 with the index 112.
- FIG. 9 shows a block diagram of the allocation of rights by means of the transmission of access rights via a chain of entities, such as individual databases or their users.
- a data record 108 with useful data 160 is received by an individual database.
- access certificates 162, 164, 166 which are assigned to a first entity, are automatically stored in the background in corresponding fields of the data record.
- the first entity is, for example, the creator of the data record, such as the corresponding individual database or a user of the same.
- the access certificates 62, 164, 166 are for different types of Configured access to the data record.
- the access certificate 162 is for read access
- the access certificate 166 for access to tokens of an index which are generated from the corresponding data record.
- an index access certificate 166 in an ID database 172 of the further entity which index accesses in relation to assigned to the data record 108 enables. This takes place, for example, in that a link between this certificate 166 and a user certificate 168 of the further entity is stored in the ID database 172, for example within the same access authorization chain object.
- access rights with regard to the data record 108 can also be assigned to a third entity.
- the index access certificate 166 is also assigned to this further entity. This is done, for example, by storing a link between this certificate 166 and a user certificate 174 of the third entity in the ID database 172, for example within the same access authorization chain object.
- the user certificate 174 can either be attached to the user certificate 168 or directly to the index access certificate 166.
- the chain of user certificates documented in the access authorization chain objects documents the transfer of access rights over a sequence of several users.
- the sequence can consist of a simple stringing together of user certificates, whereby for example the position of the user certificates represents the time series of transmissions within the chain objects.
- the chain of user certificates within an authorization chain object can be generated in that the ID database uses the private keys assigned to the individual user certificates so that the last user certificate in the chain is the new one - added new user certificate signed so that a certificate chain check is also possible within the chain of user certificates of the individual authorization chain objects.
- FIG. 10 shows a block diagram of the distributed database system 170.
- This includes, for example, a source computer system 100 and a receiving computer system 200.
- the source computer system 100 sends a supplementary data set 130 to the receiving computer system 200 via a network 180 to supplement one from the receiving database 204 of the receiving computer system 200 include an index so that this is synchronized with an index of the source database 104 of the source computer system 100.
- the source computer system 100 or a user of the same To access the index of the source database 104, the source computer system 100 or a user of the same must provide evidence of access rights. These are checked, for example, by a comparison with an access authorization chain object stored in the source ID database 172. If the access rights exist, the index can be accessed, for example, and a supplementary data record 130 can be created. So that the receiving computer system 200 is also authorized to supplement the index of the receiving database 204, a corresponding access authorization chain object is generated for this or a user of the same and is transmitted to the source ID database 174. The receiving computer system 200 can thus also prove its access rights. According to embodiments, authorization of the receiving computer system 200 or the receiving database 204 is a necessary prerequisite for sending the supplementary data record 130 with index information to the receiving database 204. According to alternative embodiments, a central ID Database for several or all computer systems 100, 200 of the distributed database system 170 can be provided.
- FIG. 11 shows a block diagram of an expanded embodiment of the exemplary computer system 100 from FIG. 1, which is configured for machine learning.
- the computer system 100 from FIG. 11 additionally comprises a learning module 120 for processing the data stored in the individual database 104.
- the learning module 120 comprises, for example, the tokenizer 120 for generating tokens 109 and also trigger definitions 123, which define triggers for a classification of tokens 109, and / or a classifier 124, which the tokens 109 using the trigger definitions 123 classified.
- the learning module 120 further comprises a statistical model 125.
- the statistical model 125 can be configured to record trigger combinations and create combined trigger definitions, create additional trigger definitions and / or correct trigger definitions. Create definition.
- the MM-DBMS 1 18 can also include the tokenizer 122 and / or access a tokenizer 122 provided by the learning module 120.
- the trigger definitions 123 can also be stored in the individual database 104.
- the MM-DBMS 1 18 and / or the learning module 120 have a built-in program logic that is configured to generate an index 1 12.
- tokens 109 in the index 112 which are included as triggers in one of the trigger definitions 123, are each assigned to the corresponding trigger definition 123. Furthermore, tokens 109 in index 112, which are comprised by one of the data records DS1, DS2, DS3 in a combination with one or more of the identified triggers according to one of the trigger definitions 123, are each assigned to one or more classes. The corresponding class assignments provide meta or context information for the corresponding tokens 109. Finally, the remaining tokens 109 are in the index 1 12, which are neither using the trigger definitions 123 nor identify as trigger, nor assign to a class as, assigned to a collection class to identify as unknown data. An assignment to the collection class excludes an assignment to one of the trigger definitions 123 as well as an assignment to one of the classes in accordance with the trigger definitions 123. The assignments described above take place, for example, using the classifier 124 of the learning module 120.
- a non-redundant, unique token set is formed from the set of tokens 109, in which each of the tokens 109 occurs only once. Even if a token 109 with a certain value and a certain class assignment occurs several times in the individual database 104 or in the data model 106, it is only stored once with this class assignment in the non-redundant token set and in the index 112, for example.
- FIG. 12 shows a schematic block diagram of an exemplary data processing by the multi-model database management system and the learning module.
- This trigger definition 123 defines two triggers, i.e. a first trigger “lives in” and a second trigger “in”.
- the trigger definition also defines that a token immediately preceding the first trigger is a surname, while a token immediately preceding the surname is a first name.
- the trigger definition also defines that a token arranged between the two triggers is a street and that a token immediately following the second trigger is a city.
- Two documents 108 are stored in a document-based data model 106 of a database.
- Each document 108 each comprises a data record DS1, DS2.
- the data records DS1, DS2 are each a text file.
- the first data record DS1 includes, for example, the sentence: "Sample first name_1 sample last name_1 lives in Musterstrasse_1 in Musterstadt_1". This sentence is broken down into tokens 109 by means of a tokenizer: "Sample first name_1", “Sample surname_1", “lives in the", “Sample street_1", “in”, “Sample city_1".
- the two tokens "lives in” and "in” are identified as triggers according to the trigger definition 123.
- the remaining tokens 109 are each assigned to the classes 1 1 1 defined by the trigger definition.
- the tokens identified as triggers like the tokens classified using these triggers, are stored in an index in a second data model 110.
- the triggers are each assigned to the trigger definition 123 in the form of a trigger assignment 1 17.
- the remaining tokens 109 are each stored in the form of a class assignment 113, assigned to one of the classes defined by the trigger definition 123.
- all triggers and classified tokens in the second data model 110 are identified with a pointer 115 to their storage location in the first data model, i.e. DS1, linked.
- the two tokens "lives in” and "in” are identified as triggers according to the trigger definition 123. Since these two triggers of the trigger definition 123 are already included in the index, they are not stored again in the second data model 110. Only a pointer to the second data record DS2 is added. Using the identified triggers and the trigger definition 123, the remaining tokens 109 of the data record DS2 are each used by the Trigger definition assigned to defined classes 1 1 1.
- the classified tokens 109 of the data record DS2 are each stored in the form of a class assignment 1 13 assigned to one of the classes defined by the trigger definition 123 and linked with a pointer 1 15 to their storage location in the first data model, ie DS2.
- All tokens of the second data record DS2 are therefore also stored in a redundancy-free form, each with their class assignments in the second data model 110 linked with a pointer to their storage location in the first data model.
- FIG. 13 shows a schematic block diagram of an exemplary data processing by the multi-model database management system and the learning module.
- This trigger definition 123 is used to classify tokens generated from an image file, the image file being broken down into tokens in the form of pixel groups.
- the trigger definition 123 defines two triggers, i.e. a first trigger in the form of a pixel group with the content “+” and a second trigger in the form of a pixel group with the content “x”.
- the trigger definition defines that a pixel group which is arranged within a first radius of N pixels around the first trigger and at the same time within a second radius of N pixels around the second trigger is a token of the Class "class" acts.
- a document 108 is stored in a document-based data model 106 of a database.
- This document 108 comprises a data record DS.
- the data record DS is a two-dimensional image file.
- This image file is broken down into tokens by means of a tokenizer, the tokens each being pixel groups 150.
- the pixel groups of equal size in Z by Z are broken down.
- the tokens include, for example, a first Token in the form of a pixel group with the content “x”, a second token in the form of a pixel group with the content “+”, a third token in the form of a pixel group with the content “#” and a fourth token in the form of a pixel group with the content
- the two tokens “+” and x ” are identified as trigger 121 according to trigger definition 123.
- the third token “#” is assigned to the class 1 1 1 defined by the trigger definition, since it is in the two-dimensional image file within a first radius 152 of N pixels around the first Trigger “+” and at the same time is arranged within a second radius 154 of N pixels around the second trigger “x”. Since the fourth token does not fall under the trigger definition 123, it is assigned to the trap class as an unknown date.
- the tokens “+” and “x” identified as triggers 121 are stored in an index in a second data model 110, as are the token “#” classified on the basis of these triggers and the token assigned to the collection class.
- the triggers “+” and “x” are each assigned to the trigger definition 123 in the form of a trigger assignment 1 17.
- the token “#” is saved in the form of a class assignment 1 13 assigned to the classes defined by the trigger definition 123.
- the token is stored in the form of an assignment 1 19 assigned to the collection classes.
- all triggers and classified tokens in the second data model 110 are identified with a pointer 115 to their storage location in the first data model, i.e. DS, linked.
- FIG. 14 shows a schematic block diagram of an embodiment of an exemplary source computer system 100 and an exemplary receiving computer system 200 of a distributed database system 170.
- the embodiment from FIG. 14 largely corresponds to the embodiments from FIG. 8.
- the two computer systems 100, 200 also have each has a learning module 120, 220 which, in addition to the tokenizer 122, 222, also has trigger definitions 123, 223 for classifying the data by means of a classifier 124, 224 and a statistical model 125, 225.
- FIG. 15 shows a flow diagram of an embodiment of a further exemplary method for implementing a cross-database index on a distributed database system.
- a pre-trained learning module for machine learning is provided for a source database of the distributed database system, which learning module comprises a plurality of predetermined trigger definitions. These predetermined trigger definitions define triggers for assigning tokens to classes of a group of classes.
- the corresponding source database is provided.
- the source database is managed by a multi-model database management system and comprises a plurality of data records which are stored in a document-oriented data model. These saved records each include one or more fields with field values.
- the source database provided includes a searchable index of all the data included in the stored data records. This index is stored redundancy-free in a further data model managed by the multi-model database management system.
- the index comprises a plurality of tokens generated from the field values of the stored data records, which are linked in the index with one or more pointers to one or more of the data records and / or fields stored in the document-oriented data model corresponding token was generated.
- the first tokens in the index which are included as triggers in one of the trigger definitions, are each assigned to the corresponding trigger definition.
- Second tokens in the index are each assigned to one or more classes of the group of classes.
- the remaining tokens in the index are finally assigned to a collection class to identify the corresponding remaining tokens as unknown data.
- the assignment to the collection class excludes an assignment to one of the trigger definitions as well as an assignment to one of the classes of the first group of classes.
- an additional data set is received and in block 606 by the multi-model database management system in the document-oriented one Data model of the source database saved.
- the storage takes place in a document or data container.
- the additional data record is processed further using a data processing function. This includes:
- one or more additional tokens are generated from additional field values that the additional data set comprises.
- one or more first additional tokens are identified as triggers if they are included as triggers by one of the trigger definitions.
- the remaining additional tokens are classified.
- the triggers identified in block 610 are used to assign one or more second additional tokens to one or more classes of the group of classes if the corresponding second additional tokens from the additional data record are in combination with one or more - reren of the identified triggers are included in accordance with one of the trigger definitions and the corresponding triggers trigger a corresponding class assignment in accordance with the corresponding trigger definition.
- the remaining additional tokens, for which no assignment to one of the trigger definitions and no class assignment based on one of the trigger definitions has taken place, are assigned to the collection class in the course of the classification in block 612.
- the index is generated by the multi-model database management system using the additional tokens from block 608, the class assignments of the additional tokens from block 612 and a pointer to the additional tokens in the document-oriented data model saved data record added. If pointers indicate individual fields of the additional data record, a plurality of pointers is used for a plurality of fields.
- the addition in block 614 can include a comparison of the additional tokens with the index. If one of the additional tokens is not included in the index, the corresponding additional token is added to its class assignments in the index and linked to the pointer to the additional data record stored in the document-oriented data model. If one of the class assignments of an additional token included in the index is not included in the index, the corresponding class assignment is supplemented with the corresponding additional token in the index and the corresponding one linked additional tokens in the index with the pointer to the additional data set stored in the document-oriented data model. If one of the additional tokens with all of its class assignments is included in the index, the corresponding additional token in the index is linked to the pointer to the additional data record stored in the document-oriented data model.
- the addition in block 614 can include identifying combinations of second additional tokens with one or more of the identified triggers, which have triggered a class assignment according to one of the trigger definitions, in the index as classified combinations.
- Class assignments are only carried out for combinations of second additional tokens and one or more identified triggers which are not identified as classified combinations.
- token combinations are compared with the index. If the index already includes the corresponding token combination and this is marked as classified, there is no new classification for this token combination. Only the corresponding token combination and / or the partial combinations and individual tokens comprised by the corresponding token combination are linked in the index with the pointer to the additional data record stored in the document-oriented data model.
- a supplementary data record resulting from the application of the data processing function is created.
- This supplementary data record comprises the additions made to the index of the source database and is used to supplement at least one second index of at least one second individual database of the distributed database system as a receive database.
- the supplementary data set is sent via a network to the second individual database for integration into the second index of the receive database.
- FIG. 16 shows a flow diagram of an embodiment of an exemplary method for generating combined trigger definitions.
- one or more trigger combinations are identified by the learning module, which are each comprised by at least one of the data records and which meet a combination criterion.
- the trigger definitions of the triggers of the corresponding trigger combinations are combined into one or more additional combined trigger definitions.
- the plurality of predetermined trigger definitions of the learning module is supplemented by the one or more additional combined trigger definitions.
- FIG. 17 shows a flow diagram of an embodiment of an exemplary method for supplementing additional trigger definitions.
- the pre-trained learning module is supplemented by one or more additional trigger definitions.
- the additional trigger definitions define additional triggers for a replacement of assignments of tokens in the index to the receiving class by assignments to one or more classes of a further group of classes in the course of a reclassification.
- the additional trigger definitions can be received by the learning module, for example.
- the corresponding additional trigger definitions are provided by an administrator.
- the additional trigger definitions to be supplemented are created by the learning module.
- the learning module includes a statistical model which is used for a statistical analysis of the tokens included in the collection classes and their occurrence in the data records. The result of the statistical analysis is used to create the additional trigger definitions to be added.
- one or more tokens assigned to the collection class are reclassified in the index, which tokens define the additional trigger definitions as additional triggers.
- the reclassification by the learning module includes a replacement of the assignment to the collection class by an assignment to the corresponding additional trigger definition, which the corresponding token as includes additional trigger.
- the additional triggers for reclassifying one or more tokens assigned to the collection class in the index to one or more classes of the further group of classes are used by the learning module if the corresponding tokens assigned to the collection class are from one of the data records in a Combination with one or more of the additional triggers are included and the corresponding additional triggers trigger a corresponding assignment to the one or more classes of the further group of classes in accordance with the corresponding additional trigger definition.
- the method for adding additional trigger definitions can be carried out repeatedly following a recursive scheme.
- the trigger definitions to be supplemented for each recursion stage each include supplements to a trigger definition of a preceding recursion stage, so that the recursive supplements form tree structures which each include one of the predetermined trigger definitions as the root node.
- FIG. 18 shows a flow diagram of an embodiment of an exemplary method for correcting trigger definitions in blocks.
- a corrected trigger definition for replacing one of the stored trigger definitions of the learning module is received.
- This corrected trigger definition is provided by an administrator, for example.
- the corrected trigger definition is created by the learning module using a statistical model.
- the corresponding stored trigger definition is replaced by the corrected trigger definition.
- the tokens classified using the corresponding stored trigger definition are reclassified, the reclassification taking place using the corrected trigger definition. List of reference symbols
- DS1, ..., DS3 data records K 1 , ..., K 6 nodes D 1 , ..., D 6 data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| DE102019108856.9A DE102019108856A1 (de) | 2019-04-04 | 2019-04-04 | Datenbankübergreifender Index auf einem verteilten Datenbanksystem |
| PCT/EP2020/059041 WO2020201248A1 (de) | 2019-04-04 | 2020-03-31 | Datenbankübergreifender index auf einem verteilten datenbanksystem |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| EP3948576A1 true EP3948576A1 (de) | 2022-02-09 |
Family
ID=70165999
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP20716739.6A Pending EP3948576A1 (de) | 2019-04-04 | 2020-03-31 | Datenbankübergreifender index auf einem verteilten datenbanksystem |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP3948576A1 (de) |
| DE (1) | DE102019108856A1 (de) |
| WO (1) | WO2020201248A1 (de) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| DE19627472A1 (de) * | 1996-07-08 | 1998-01-15 | Ser Systeme Ag | Datenbanksystem |
| US20090210429A1 (en) * | 2008-02-19 | 2009-08-20 | Yahoo! Inc. | System and method for asynchronous update of indexes in a distributed database |
| DE102010043265A1 (de) * | 2009-11-06 | 2011-05-12 | Symantec Corporation, Mountain View | Systeme und Verfahren zum Verarbeiten und Verwalten von objektbezogenen Daten zur Verwendung durch mehrere Anwendungen |
| US20130173536A1 (en) * | 2008-07-02 | 2013-07-04 | Commvault Systems, Inc. | Distributed indexing system for data storage |
| US20160055143A1 (en) * | 2014-08-21 | 2016-02-25 | Dropbox, Inc. | Multi-user search system with methodology for bypassing instant indexing |
| US20160259810A1 (en) * | 2015-03-06 | 2016-09-08 | Hewlett-Packard Development Company, L.P. | Global file index |
| DE102016226338A1 (de) * | 2016-12-30 | 2018-07-05 | Bundesdruckerei Gmbh | Bitsequenzbasiertes Datenklassifikationssystem |
-
2019
- 2019-04-04 DE DE102019108856.9A patent/DE102019108856A1/de active Pending
-
2020
- 2020-03-31 WO PCT/EP2020/059041 patent/WO2020201248A1/de not_active Ceased
- 2020-03-31 EP EP20716739.6A patent/EP3948576A1/de active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| DE19627472A1 (de) * | 1996-07-08 | 1998-01-15 | Ser Systeme Ag | Datenbanksystem |
| US20020133476A1 (en) * | 1996-07-08 | 2002-09-19 | Gert J. Reinhardt | Database system |
| US20090210429A1 (en) * | 2008-02-19 | 2009-08-20 | Yahoo! Inc. | System and method for asynchronous update of indexes in a distributed database |
| US20130173536A1 (en) * | 2008-07-02 | 2013-07-04 | Commvault Systems, Inc. | Distributed indexing system for data storage |
| DE102010043265A1 (de) * | 2009-11-06 | 2011-05-12 | Symantec Corporation, Mountain View | Systeme und Verfahren zum Verarbeiten und Verwalten von objektbezogenen Daten zur Verwendung durch mehrere Anwendungen |
| US20160055143A1 (en) * | 2014-08-21 | 2016-02-25 | Dropbox, Inc. | Multi-user search system with methodology for bypassing instant indexing |
| US20160259810A1 (en) * | 2015-03-06 | 2016-09-08 | Hewlett-Packard Development Company, L.P. | Global file index |
| DE102016226338A1 (de) * | 2016-12-30 | 2018-07-05 | Bundesdruckerei Gmbh | Bitsequenzbasiertes Datenklassifikationssystem |
Non-Patent Citations (3)
| Title |
|---|
| GREENSTEIN B ET AL: "DIFS: a distributed index for features in sensor networks", NEW FRONTIERS IN TELECOMMUNICATIONS : 2003 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS ; ICC 2003 ; 11 - 15 MAY 2003, ANCHORAGE, ALASKA, USA; [IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS], IEEE OPERATIONS CENTER, PISCATAWAY, NJ, 1 January 2003 (2003-01-01), pages 163 - 173, XP010642628, ISBN: 978-0-7803-7802-5 * |
| LOMET D: "Replicated indexes for distributed data", PARALLEL AND DISTRIBUTED INFORMATION SYSTEMS, 1996., FOURTH INTERNATIO NAL CONFERENCE ON MIAMI BEACH, FL, USA 18-20 DEC. 1996, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 18 December 1996 (1996-12-18), pages 108 - 119, XP010213181, ISBN: 978-0-8186-7475-4, DOI: 10.1109/PDIS.1996.568673 * |
| See also references of WO2020201248A1 * |
Also Published As
| Publication number | Publication date |
|---|---|
| DE102019108856A1 (de) | 2020-10-08 |
| WO2020201248A1 (de) | 2020-10-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| DE102013209868B4 (de) | Abfragen und Integrieren strukturierter und unstrukturierter Daten | |
| EP3948577B1 (de) | Automatisiertes maschinelles lernen auf basis gespeicherten daten | |
| DE112011103273B4 (de) | Verfahren, Computerprogrammprodukt und Vorrichtung zur Weitergabe von Identitäten über Anwendungsebenen unter Verwendung von kontextabhängiger Zuordnung und gesetzten Werten | |
| DE112020002892T5 (de) | Aktives lernen für den datenabgleich | |
| EP1779263A1 (de) | Sprach- und textanalysevorrichtung und entsprechendes verfahren | |
| DE112021000689T5 (de) | Attestierung von neuronalen abläufen | |
| EP3889806B1 (de) | Bitsequenzbasiertes datenklassifikationssystem | |
| WO2021204943A2 (de) | Überwachungssystem mit mehrstufiger anfrageprüfung | |
| EP3552140B1 (de) | Datenbankindex aus mehreren feldern | |
| EP3552141B1 (de) | Server-computersystem zur bereitstellung von datensätzen | |
| DE102024204719A1 (de) | System zum Verhindern von Missbrauch großer Basismodelle und ein Verfahren dafür | |
| EP4002145B1 (de) | Listenbasierte datenspeicherung zur datensuche | |
| EP4123517A1 (de) | Integration verteilter machine-learning modelle | |
| EP3948576A1 (de) | Datenbankübergreifender index auf einem verteilten datenbanksystem | |
| DE112021006377T5 (de) | Terminieren von abfrageausführungsplänen in einer relationalen datenbank | |
| EP4736401A1 (de) | Computerimplementiertes verfahren zur bereitstellung einer, insbesondere zumindest teilweise textbasierten, ausgangssequenz | |
| DE102019108858A1 (de) | Maschinelles Lernen auf Basis von Trigger-Definitionen | |
| WO2021204945A1 (de) | Mikrocontroller- oder mikroprozessor-basiertes system mit berechtigungsprüfung für anfragen | |
| EP4462315A1 (de) | Vorrichtungen und verfahren für föderales rechnen | |
| Hui et al. | Analysis of decision tree classification algorithm based on attribute reduction and application in criminal behavior | |
| EP4116858A1 (de) | Maschinelles lernen auf basis von datenbankoperationen | |
| CN120610875A (zh) | 基于云服务器数据库的安全分析方法、装置、计算机设备 | |
| DE102024201985A1 (de) | Verfahren zum Verifizieren der Informationssicherheit (Security) und/oder Betriebssicherheit (Safety) eines technischen Systems | |
| DE102022125399A1 (de) | Detektieren eines Angriffs auf ein zu schützendes Computersystem | |
| DE202025002939U1 (de) | Stärkung menschlicher Innovation und Exzellenz |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
| 17P | Request for examination filed |
Effective date: 20211104 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| DAV | Request for validation of the european patent (deleted) | ||
| DAX | Request for extension of the european patent (deleted) | ||
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
| 17Q | First examination report despatched |
Effective date: 20230418 |
|
| P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230526 |
|
| RAP3 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: BUNDESDRUCKEREI GMBH |