WO2019128525A1 - 确定数据异常的方法及装置 - Google Patents
确定数据异常的方法及装置 Download PDFInfo
- Publication number
- WO2019128525A1 WO2019128525A1 PCT/CN2018/116085 CN2018116085W WO2019128525A1 WO 2019128525 A1 WO2019128525 A1 WO 2019128525A1 CN 2018116085 W CN2018116085 W CN 2018116085W WO 2019128525 A1 WO2019128525 A1 WO 2019128525A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- distribution
- historical
- data
- determining
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/142—Network analysis or design using statistical or mathematical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/64—Protecting data integrity, e.g. using checksums, certificates or signatures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
Definitions
- One or more embodiments of the present specification relate to the field of computer technology, and more particularly to methods and apparatus for determining data anomalies.
- Data processing often goes through multiple links in the business chain.
- Data may also be abnormal during transmission of a certain business link or between different business links, such as the system being attacked, the model being abnormal, and so on.
- One or more embodiments of the present specification describe a method and apparatus that can determine and alert anomalies of data independent of the business meaning of the data.
- a method for determining a data anomaly includes: obtaining a plurality of data packets within a predetermined time period, the plurality of data packets having the same data structure; acquiring historical data having the same data structure a historical distribution; comparing the plurality of data packets with the historical distribution; and determining whether there is a data abnormality according to the comparison result.
- comparing the plurality of data packets with the historical distribution comprises: obtaining the plurality of data packets in the historical distribution by substituting the plurality of data packets into the historical distribution a distribution state parameter; comparing the plurality of distribution state parameters with a predetermined threshold associated with the distribution state to determine a number of data packets that exceed the threshold.
- determining, according to the comparison result, whether there is a data abnormality comprises: determining whether there is a data abnormality according to the number of the data packets exceeding the threshold.
- comparing the plurality of data packets with the historical distribution comprises: determining a data distribution state of the plurality of data packets as a current distribution; comparing the current distribution with the historical distribution Correct.
- comparing the current distribution with the historical distribution comprises: determining a distribution center of the current distribution; acquiring a distribution center of the historical distribution; determining a distribution center and history of the current distribution The offset between the distribution's distribution centers.
- determining, according to the comparison result, whether there is a data abnormality comprises: determining that there is a data abnormality in response to the offset exceeding a predetermined offset threshold.
- comparing the current distribution with the historical distribution comprises: determining a distribution state parameter of the randomly extracted data packet in the current distribution, that is, a first parameter; determining the randomly extracted data a distribution state parameter included in the historical distribution, that is, a second parameter; determining a difference between the first parameter and the second parameter.
- determining, according to the comparison result, whether there is a data abnormality comprises: determining that there is a data abnormality in response to the difference exceeding a predetermined difference threshold.
- the historical distribution is a historical probability distribution obtained by processing the historical data by using a mixed Gaussian model; correspondingly, the current distribution is represented by a current probability obtained by processing a plurality of current data packets by using a mixed Gaussian model.
- Distribution; the above distribution state parameters can be embodied as probability values; the distribution center is embodied as the peak position of the probability distribution curve.
- the historical distribution is a historical clustering distribution obtained by using a clustering algorithm for the historical data; correspondingly, the current distribution is represented by using the same clustering algorithm to process the current plurality of data packets.
- the current clustering distribution; the above distribution state parameter can be embodied as the position of the data packet in the cluster distribution space and the distance of the corresponding cluster center; the distribution center is embodied as the center position of the cluster distribution.
- an apparatus for determining a data anomaly includes: a data packet acquisition unit configured to acquire a plurality of data packets within a predetermined time period, the plurality of data packets having the same data structure; a history acquisition unit, configured And a comparison unit configured to compare the plurality of data packets with the historical distribution; and the determining unit configured to determine whether the existence exists according to the comparison result The data is abnormal.
- a computer readable storage medium having stored thereon a computer program for causing a computer to perform the method of the first aspect when the computer program is executed in a computer.
- a computing device comprising a memory and a processor, the memory storing executable code, and when the processor executes the executable code, implementing the method of the first aspect above.
- the currently obtained data packet is compared with the historical distribution obtained based on the historical data statistics, and whether the current data packet has a data abnormality is determined according to the comparison result. In this way, it is possible to effectively judge and warn of data anomalies without relying on the business meaning of the data.
- Figure 1 shows a schematic diagram of one embodiment of the present disclosure
- Figure 2 shows a flow chart of a method in accordance with one embodiment
- Figure 3 illustrates a flow chart of a comparison and determination process in accordance with one embodiment
- FIG. 4 shows a flow chart of a comparison and determination process in accordance with another embodiment
- Figure 5 shows a schematic diagram of the alignment peak position
- Figure 6 illustrates a flow chart of a comparison and determination process in accordance with yet another embodiment
- Figure 7 shows a schematic diagram of the comparison probability values
- Figure 8 shows a schematic block diagram of a determining device in accordance with one embodiment.
- FIG. 1 is a schematic illustration of one embodiment of the disclosure.
- a computing platform such as an Alipay server, acquires a plurality of data packets within a predetermined time period, such as a credit request data packet for a user requesting a loan. These packets have the same data structure, for example with the same fields, and the field contents can include the user's age, gender, income, loan amount, and so on.
- the computing platform also obtains a historical distribution of historical data having the same data structure as above, which may be composed of a large number of similar data packets received over a long period of time.
- the computing platform compares the plurality of data packets with the historical distribution, and determines whether there is a data abnormality according to the comparison result.
- the computing platform can continue to process the data or send it to the next business segment. If it is determined that there is a data anomaly, an alert can be initiated to notify the relevant personnel to analyze the cause of the data anomaly and trigger the relevant solution.
- the specific implementation process for determining data anomalies is described below.
- FIG. 2 illustrates a method flow diagram in accordance with one embodiment.
- the executive body of the method can be any computing platform with computing power and processing power, such as a server.
- the method includes: Step 21: acquiring a plurality of data packets within a predetermined time period, the plurality of data packets having the same data structure; and Step 22, acquiring a history of historical data having the same data structure Distribution; step 23, comparing the plurality of data packets with the historical distribution; and step 24, determining whether there is a data abnormality according to the comparison result.
- Step 21 acquiring a plurality of data packets within a predetermined time period, the plurality of data packets having the same data structure
- Step 22 acquiring a history of historical data having the same data structure Distribution
- step 23 comparing the plurality of data packets with the historical distribution
- step 24 determining whether there is a data abnormality according to the comparison result.
- step 21 a plurality of data packets within a predetermined time period are acquired.
- the plurality of data packets are data packets received from an external organization, such as a user's credit request data packet received by an Alipay server from a bank or financial institution. At this time, it is possible to determine whether there is a data abnormality in the data packet transmitted by the external mechanism through a subsequent step. Such data anomalies may be caused by attacks, tampering, or user group offsets during data transfer.
- the plurality of data packets are data packets generated by a certain service link in a service chain of data processing. For example, for a credit request packet received from an external organization, it is required to perform processing on multiple business links, including field parsing, dimensionality reduction, model calculation, and the like.
- the subsequent analysis of the data packets generated in any business link can be performed. Through subsequent analysis, it can be determined whether there is data anomaly in the data packet of the service link.
- the data anomaly in a certain business link may be caused by a problem with data transmission between business links or a problem with the system or model of the business link. Therefore, abnormal judgment of the data packets of the intermediate business link can also help determine whether the system or the model has an abnormality.
- each packet can have the same field.
- the data packet is a credit request packet, and the fields of each packet may include user age, gender, income, loan amount, and the like.
- the data packet is a user operation record for analyzing a user's behavior pattern. At this time, the fields of each data packet may include a user ID, an operation behavior, an operation object, an operation time, and the like.
- the plurality of data packets acquired in step 21 are data packets within a predetermined time period from the current time.
- the predetermined period of time described above can be set to a shorter period of time such that the plurality of packets obtained are the most recently generated packets.
- the predetermined time period may be preset to 0.5 s or 1 s; in the case where the amount of data is relatively small, the predetermined time period may be preset to 30 s or 1 min.
- the number of acquired packets can also be selected according to business needs. In one embodiment, all of the data packets within the predetermined time period are obtained. In another embodiment, a predetermined number of data packets within the predetermined time period are acquired.
- the historical distribution of the historical data is acquired.
- the above historical data has the same data structure as the data packet obtained in step 21.
- the historical data includes a large number of similar data packets received over a long period of time, such as a week or a month.
- the historical data described above has the same data source as the data packet obtained in step 21.
- the data packet in step 21 is a credit request data packet from an external organization
- the historical data is also a credit request data packet received from an external organization for a long period of time.
- the data packet in step 21 is a data packet generated by an intermediate service link
- the historical data is a data packet generated by the same service link in a previous long period of time.
- the historical distribution of historical data can be obtained by performing statistics and operations on historical data.
- step 23 the currently obtained plurality of data packets are compared with the historical distribution; in step 24, based on the comparison result, it is determined whether the current data packet has a data abnormality.
- step 24 determines the comparison result generated by the different comparison methods.
- FIG. 3 illustrates a flow chart of a comparison and determination process in accordance with one embodiment. It can be understood that the flow shown in FIG. 3 is a sub-step of steps 23 and 24 in FIG.
- the process includes the following steps.
- step 31 a plurality of data packets are substituted into the historical distribution, so that the distribution state parameters of the plurality of data packets in the historical distribution are obtained, and a plurality of distribution state parameters are obtained.
- step 32 the plurality of distribution state parameters are obtained.
- the thresholds associated with the predetermined distribution state are compared to determine the number of packets exceeding the threshold; then, in step 33, it is determined whether there is a data anomaly based on the number of packets exceeding the threshold.
- the above process is to substitute a plurality of currently acquired data packets into the historical distribution, and determine the distribution state of the multiple data packets in the historical distribution. If a large number of abnormal states (beyond the threshold) occur, the current data abnormality is considered to exist.
- the historical distribution is a data distribution state obtained by performing statistics and operations on historical data.
- Different algorithms can be used to obtain historical distribution based on historical data.
- the following is an example of a Gaussian mixture model and a clustering algorithm to describe the specific execution process of the process of FIG. 3 in the case of different forms of historical distribution.
- the historical distribution is a historical probability distribution obtained by processing a historical data using a Gaussian Mixture Model (GMM).
- GMM Gaussian Mixture Model
- the mixed Gaussian model GMM is a model that accurately quantizes things using a Gaussian probability density function, which can decompose a thing into several forms based on a Gaussian probability density function.
- the Gaussian probability density function usually appears in the form of a normal distribution curve.
- each data packet or historical data obtained currently has the same data structure, and the data structure contains several fields.
- a mixed Gaussian model can be used to decompose the data fields as dimensions to determine a mixed Gaussian probability density function as a historical probability distribution.
- the flow of FIG. 3 can be specifically performed as follows.
- step 31 the plurality of currently acquired data packets are substituted into the historical probability distribution, and the probability values of the plurality of data packets in the historical probability distribution are obtained to obtain a plurality of probability values. That is to say, in the case where the historical distribution is a Gaussian probability distribution, the distribution state parameter is embodied as a distribution probability value.
- the probability values of the N data packets in the historical probability distribution are p 1 , p 2 , ... p N , respectively .
- the determined plurality of probability values are compared with a predetermined probability threshold, and the number of data packets whose probability values are less than the probability threshold, ie, the first number, is determined.
- the predetermined probability threshold is a small probability value p 0 .
- the specific size of the probability threshold p 0 can be set according to business needs.
- step 33 it is determined whether there is a data anomaly based on the first number M1. More specifically, in one example, if the first number M1 exceeds the predetermined number of thresholds M0, it is determined that there is a data anomaly. That is to say, if there are more than M0 data packets in the currently obtained N data packets as "small probability" data packets, it is considered that there is an abnormality currently. In another example, if the ratio M1/N of the first number M1 to the number N of the plurality of data packets exceeds a predetermined ratio threshold, it is determined that there is a data anomaly. That is to say, once N packets have a packet exceeding a predetermined ratio as a "small probability” packet, it is considered that there is currently a data exception.
- the number or proportion of "small probability" data packets in the currently acquired data packet is determined.
- the number or ratio exceeds a certain threshold, it can be determined that there is currently a data abnormality.
- the clustering algorithm is an algorithm for performing similarity analysis and data statistics on multiple samples.
- each data packet can be mapped to a point in multidimensional space by using the fields in the data packet as a dimension, and then multiple data packets are divided and clustered by using a clustering algorithm.
- a clustering algorithm such as a BIRCH algorithm, a CURE algorithm, or the like, or a density-based clustering algorithm, such as a DBSCAN algorithm, an OPTICS algorithm, and the like, and other possible clustering algorithms may be employed.
- the flow of FIG. 3 can be specifically performed as follows.
- step 31 determining a location of each of the plurality of currently acquired data packets in a distribution space of the historical cluster distribution, obtaining a plurality of locations, and determining a corresponding clustering center of the plurality of locations and the historical clustering distribution
- the distance between the locations gives multiple distances.
- each data packet can be mapped to a point in a multidimensional space composed of fields, which is also the distribution space of the historical cluster distribution.
- the positions P 1 , P 2 , . . . P N of the currently acquired N data packets in the space of the historical cluster distribution may be determined by the above mapping.
- the position of the cluster center in the historical cluster distribution may also be determined, and then the distance D 1 , D between the respective cluster positions of the respective positions P 1 , P 2 , . . . P N corresponding to the historical cluster distribution may be determined. 2 ,...D N . It can be understood that the above distance is the distance between the multi-dimensional vectors in the multi-dimensional space.
- the plurality of distances are compared with a preset distance threshold to determine a number that exceeds the preset distance threshold, that is, a second number.
- the cluster center is located in the area with the most dense data distribution, and the farther away from the cluster center, the lower the frequency or probability of occurrence of the corresponding data packet. Therefore, the above predetermined distance threshold can be set to a larger distance D 0 .
- step 33 it is determined whether there is a data anomaly based on the second number described above. Specifically, in one example, if the second number M2 exceeds a predetermined number of thresholds M0, then a data anomaly is considered to be present. That is to say, once more than M0 data packets of the currently acquired N data packets are far away from the cluster center, it is considered that there is currently a data abnormality. In another example, if the ratio M2/N of the second number M2 to the number N of the plurality of data packets exceeds a predetermined ratio threshold, it is determined that there is a data anomaly. That is to say, once the currently acquired N data packets have more than a predetermined proportion of data packets away from the cluster center, it is considered that there is currently a data abnormality.
- the number or proportion of data packets away from the cluster center in the currently acquired data packet is determined.
- the number or ratio exceeds a certain threshold, it can be determined that there is currently a data abnormality.
- the above example combining the mixed Gaussian distribution and the clustering distribution describes an implementation in which the current data packet is substituted into the historical distribution for comparison.
- the distribution state of the plurality of currently acquired data packets may be further determined as the current distribution, and the current distribution is compared with the historical distribution. Right, to determine the data anomaly.
- Figure 4 illustrates a flow chart of a comparison and determination process in accordance with another embodiment. It will be understood that the flow shown in Figure 4 is a sub-step of steps 23 and 24 in Figure 2.
- the data distribution state of the plurality of currently acquired data packets is determined as the current distribution; in step 42, the distribution center of the current distribution is determined; in step 43, the distribution center of the historical distribution is acquired; 44. Determine an offset between a distribution center of the current distribution and a distribution center of the historical distribution.
- step 45 determine whether the offset exceeds a predetermined offset threshold. If the predetermined offset threshold is exceeded, in step 46, determine There is a data exception.
- the above process compares the distribution center of the current distribution with the distribution center of the historical distribution. When an offset exceeding a predetermined threshold occurs between the two, the current data abnormality is considered to exist.
- the following is still combined with an example of a Gaussian mixture model and a clustering algorithm to describe the specific execution process of the flow of FIG. 4 in the case of different forms of historical distribution.
- the historical distribution is a historical probability distribution obtained by processing historical data using a mixed Gaussian model.
- the same mixed Gaussian model is used to process the plurality of currently acquired data packets, and the mixed Gaussian distribution of the current data packet is obtained as the current probability distribution.
- a first peak position P1 is determined, the first peak position being the peak position of the curve corresponding to the current probability distribution.
- a second peak position P2 is obtained, which is the peak position of the curve corresponding to the historical probability distribution.
- the distribution center is embodied as the peak position of the probability distribution curve.
- step 44 a positional offset ⁇ P between the first peak position P1 and the second peak position P2 is determined. And, in step 45, it is judged whether the positional deviation ⁇ P exceeds a predetermined offset threshold, and if the positional deviation exceeds a predetermined threshold, it is determined that there is currently a data abnormality.
- FIG. 5 shows a schematic of the above aligned peak positions.
- V1 is a curve schematically showing a current probability distribution
- V2 is a curve schematically showing a historical probability distribution.
- P1 the peak position of the curve V1
- P2 the peak position of the curve V2 is shown schematically as P2. If the offset ⁇ P between the peak positions P1 and P2 is greater than a preset threshold, it can be considered that the current probability distribution changes greatly compared to the historical probability distribution, so there is currently a possibility of data anomaly.
- the mixed Gaussian model may perform data statistics based on more dimensions, and the actual probability distribution curve is a multi-dimensional multi-dimensional graph.
- Figure 5 is merely a simplified illustration.
- step 41 the clustering distribution of the plurality of currently acquired data packets is determined as the current clustering distribution by using the same clustering algorithm.
- step 42 a first center position is determined, the first center position being a center position of the cluster corresponding to the current cluster distribution; and in step 43, a second center position is obtained, the second center position being the The historical clustering distribution corresponds to the center position of the cluster.
- the distribution center is embodied as the center position of the cluster distribution. It can be understood that although different clustering algorithms can cluster data into a plurality of different clusters or clusters, the center of the entire cluster distribution can always be determined by re-clustering or centering the center of gravity. Thus, the first center position and the second center position described above are respectively determined.
- the distance D between the first central position and the second central position is determined. It can be understood that the distance can be the distance between a multi-dimensional point or a multi-dimensional vector in a multi-dimensional space.
- step 45 it is judged whether or not the above-described distance D exceeds a predetermined distance threshold. If D exceeds the distance threshold, it is determined that there is a data anomaly.
- Figure 6 illustrates a flow chart of a comparison and determination process in accordance with yet another embodiment. It will be understood that the flow shown in Figure 6 is a sub-step of steps 23 and 24 in Figure 2. As shown in FIG.
- step 61 the data distribution state of the currently acquired plurality of data packets is determined as the current distribution; in step 62, the distribution state parameter of the randomly extracted data packet in the current distribution is determined, that is, the first a parameter; in step 63, determining a distribution state parameter of the randomly extracted data packet in the historical distribution, that is, a second parameter; and in step 64, determining a difference between the first parameter and the second parameter; At step 65, it is determined whether the difference exceeds a predetermined difference threshold. If the predetermined difference threshold is exceeded, at step 66, it is determined that there is a data anomaly.
- the above process randomly extracts a sample, and compares the respective state parameters of the sample in the current distribution and the historical distribution. When the difference between the two exceeds a predetermined threshold, it is considered that there is currently a data abnormality.
- the following is still combined with an example of a Gaussian mixture model and a clustering algorithm to describe the specific execution process of the flow of FIG. 6 in the case of different forms of historical distribution.
- the historical distribution is a historical probability distribution obtained by processing historical data using a mixed Gaussian model.
- the flow of FIG. 6 can be performed as follows.
- the mixed Gaussian model is used to determine the mixed Gaussian probability distribution of the plurality of currently acquired data packets as the current probability distribution.
- the probability of occurrence of the randomly extracted data packet in the current probability distribution that is, the first probability p1
- the probability of occurrence of the randomly extracted data packet in the historical probability distribution that is, the second probability p2 is determined.
- a probability difference ⁇ p between the first probability p1 and the second probability p2 is determined.
- Figure 7 shows a schematic of the above comparison probability values.
- V1 is a curve schematically showing a current probability distribution
- V2 is a curve schematically showing a historical probability distribution. It can be seen that the corresponding probability of the randomly extracted data packet R in the current probability distribution curve V1 is p1, and the corresponding probability in the historical probability distribution curve V2 is p2. If the difference ⁇ p between the probabilities p1 and p2 is greater than a preset threshold, it can be considered that the current probability distribution has undergone a large change compared to the historical probability distribution, so there is currently an possibility of an abnormality.
- the mixed Gaussian model is generally based on multi-dimensional data statistics, and the actual distribution graph is a multi-dimensional multi-dimensional graph.
- Figure 7 is merely a simplified illustration.
- step 61 the clustering distribution of the plurality of currently acquired data packets is determined as the current clustering distribution by using the same clustering algorithm.
- step 62 the distance between the position of the randomly extracted data packet in the current cluster distribution and the center of the cluster corresponding to the current cluster distribution, that is, the first distance D1 is determined; in step 63, the randomly extracted data is determined.
- the distance between the position of the cluster in the historical clustering distribution and the center of the cluster corresponding to the historical clustering distribution that is, the second distance D2. It can be understood that the randomly extracted data packet can be mapped to a point in the cluster distribution space, and the first distance D1 and the second distance D2 are between the point mapped by the randomly extracted data packet and the center of the corresponding cluster distribution.
- the distance can be obtained by calculating the distance between multidimensional vectors in a multidimensional space.
- a distance difference ⁇ D between the first distance D1 and the second distance D2 is determined.
- the acquisition of the historical distribution is not limited to these two examples, and other algorithms can be used as long as the statistical distribution of the historical data can be analyzed.
- the manner in which the currently acquired data packets are compared with the historical distribution is described above in connection with the specific flows of FIGS. 3, 4, and 6, it will be understood that the manner of comparison is not limited to the examples of these specific descriptions. According to the specific form of the historical distribution, other distribution state parameters can also be compared.
- the current historical data packet may be compared with the historical distribution to determine whether the current data exists by statistical distribution data. Anomalies, which are more effective in discovering data anomalies and thus alerting and intervening.
- FIG. 8 shows a schematic block diagram of a determining device in accordance with one embodiment.
- the determining apparatus 800 includes: a data packet acquiring unit 81 configured to acquire a plurality of data packets within a predetermined time period, the plurality of data packets having the same data structure; and a history obtaining unit 82 configured to acquire The historical distribution of the historical data of the same data structure; the comparing unit 83 is configured to compare the plurality of data packets with the historical distribution; and the determining unit 84 is configured to determine whether there is a data abnormality according to the comparison result .
- the comparison unit 83 includes (shown by the dotted line on the left): the substitution module 831 is configured to acquire the plurality of data packets by substituting the plurality of data packets into the historical distribution.
- the distribution state parameter in the historical distribution obtains a plurality of distribution state parameters;
- the threshold comparison module 832 is configured to compare the plurality of distribution state parameters with a predetermined threshold associated with the distribution state, and determine that the distribution state parameter exceeds The number of packets of the threshold.
- the determining unit 84 is configured to determine whether there is a data abnormality according to the number of data packets in which the distribution state parameter exceeds the threshold.
- the comparison unit 83 includes (shown by the dotted line on the right): a distribution determination module 833 configured to determine a data distribution state of the plurality of data packets as a current distribution; a distribution comparison module 834, configured To compare the current distribution with the historical distribution.
- the distribution comparison module 834 is configured to: determine a distribution center of the current distribution, and obtain a distribution center of the historical distribution; determine a distribution of the distribution center of the current distribution and the historical distribution The offset between the centers. Accordingly, the determining unit 84 is configured to determine that there is a data anomaly in response to the offset exceeding a predetermined offset threshold.
- the distribution comparison module 834 is configured to: determine a distribution state parameter of the randomly extracted data packet in the current distribution, ie, a first parameter; and determine the randomly extracted data packet in the historical distribution a distribution state parameter in the second parameter; determining a difference between the first parameter and the second parameter. Accordingly, the determining unit 84 is configured to determine that there is a data anomaly in response to the difference exceeding a predetermined difference threshold.
- the historical distribution is a historical probability distribution obtained by processing the historical data using a mixed Gaussian model.
- the proxying module 831 is configured to obtain a probability value of each of the plurality of data packets in the historical probability distribution by substituting the plurality of data packets into the historical probability distribution to obtain a plurality of probability values;
- the threshold comparison module 832 is configured to compare the plurality of probability values with a predetermined probability threshold to determine a number of data packets whose probability value is less than the probability threshold, ie, the first number.
- the determining unit 84 is configured to determine that there is a data anomaly in response to the first number exceeding a predetermined number of thresholds, or a ratio between the first number and the number of the plurality of data packets exceeding a predetermined ratio threshold.
- the distribution determination module 833 is configured to process the plurality of data packets using a mixed Gaussian model to obtain a current probability distribution; the distribution comparison module 834 is configured to : Aligning the current probability distribution with the historical probability distribution.
- the distribution comparison module 834 is configured to: determine a first peak position, the first peak position being a peak position of a curve corresponding to the current probability distribution; acquiring a second peak position, The second peak position is a peak position of a curve corresponding to the historical probability distribution; and a position offset between the first peak position and the second peak position is determined.
- the determining unit 84 is configured to determine that there is a data anomaly in response to the positional offset exceeding a predetermined offset threshold.
- the distribution comparison module 834 is configured to: determine an occurrence probability of the randomly extracted data packet in the current probability distribution, ie, a first probability; and determine the randomly extracted data packet in the historical probability distribution The probability of occurrence, that is, the second probability; determining the probability difference between the first probability and the second probability. Accordingly, the determining unit 84 is configured to determine that there is a data anomaly in response to the probability difference exceeding a predetermined difference threshold.
- the historical distribution is a historical clustering distribution obtained using a clustering algorithm for the historical data.
- the above-mentioned substitution module 831 is configured to: determine a position of each of the plurality of data packets in a distribution space of the historical cluster distribution, to obtain a plurality of locations; and determine each of the plurality of locations to cluster with the history The distance between the clustering center positions is distributed to obtain a plurality of distances.
- the threshold comparison module 832 is configured to: compare the plurality of distances with a preset distance threshold, and determine that the distance exceeds the preset distance threshold. The number of packets, the second number.
- the determining unit 84 is configured to determine that there is a data anomaly in response to the second number exceeding a predetermined number of thresholds, or a ratio between the second number and the number of the plurality of data packets exceeding a predetermined ratio threshold.
- the distribution determining module 833 is configured to: process the plurality of data packets by using the clustering algorithm to obtain a current clustering distribution; Module 834 is configured to: compare the current cluster distribution to the historical cluster distribution.
- the distribution comparison module 834 is configured to: determine a first central location, the first central location is a central location of a cluster corresponding to the current cluster distribution; and acquire a second center a location, the second central location being a central location of the cluster corresponding to the historical cluster distribution; determining a distance between the first central location and the second central location. Accordingly, the determining unit 84 is configured to determine that there is a data anomaly in response to the distance exceeding a predetermined distance threshold.
- the distribution comparison module 834 is configured to determine a distance between a location of the randomly extracted data packet in the current cluster distribution and a center of the cluster corresponding to the current cluster distribution, That is, a first distance; determining a distance between a position of the randomly extracted data packet in the historical cluster distribution and a center of the cluster corresponding to the historical cluster distribution, that is, a second distance; determining the first The difference in distance between the distance and the second distance. Accordingly, the determining unit 84 is configured to determine that there is a data anomaly in response to the distance difference exceeding a predetermined difference threshold.
- the current historical data packet can be compared with the historical distribution to determine whether there is a data abnormality by statistical distribution data. In order to detect data anomalies more effectively, and to conduct early warning and intervention.
- a computer readable storage medium having stored thereon a computer program that, when executed in a computer, causes the computer to perform the method described in connection with Figures 2-6.
- a computing device comprising a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, implementing the combination of FIG. 2 to FIG. 6 Said method.
- the functions described herein can be implemented in hardware, software, firmware, or any combination thereof.
- the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Analysis (AREA)
- Databases & Information Systems (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Computational Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
一种确定数据异常的方法和装置,方法包括:获取预定时间段内具有相同数据结构的多个数据包(21);获取具有所述相同数据结构的历史数据的历史分布(22);将多个数据包与历史分布进行比对,得到比对结果(23);根据比对结果,确定是否存在数据异常(24)。所述装置与上述方法相对应。通过上述方法和装置,可以有效地判断出当前获取的数据包中的数据异常。
Description
本说明书一个或多个实施例涉及计算机技术领域,尤其涉及确定数据异常的方法和装置。
随着互联网的升级,不同平台之间的数据交互越来越频繁。例如,用户向银行或某个金融平台提交贷款请求,银行或金融平台会将用户的请求数据发送到综合计算平台进行全面评估,来决定是否批准该用户的贷款请求,以及核准的贷款金额。然而,各个平台为了安全性和自身用户保密性的考虑,在将数据发送给其他平台处理之前,往往会对数据进行一些隐私保护处理。这样处理之后的数据会失去其业务含义。接收平台在接收到这些数据之后,很难根据业务规则对数据中是否存在异常进行判断。例如,数据传送过程中是否被攻击,是否被篡改,用户群体是否发生了偏移等。
即使是在同一计算平台中,数据的处理往往要经过业务链中的多个环节。数据也有可能在某个业务环节,或者不同业务环节之间的传输过程中出现异常,例如系统被攻击、模型出现异常等等。
另一方面,在大数据背景下,数据量指数增长,而业务规则又是不断变化难以穷尽的,因此仅通过业务规则来发现数据异常,工作量巨大而不够全面。
因此,需要更有效的方式,对数据的异常进行判断和预警。
发明内容
本说明书一个或多个实施例描述了一种方法和装置,可以不依赖于数据的业务含义,对数据的异常进行判断和预警。
根据第一方面,提供了一种确定数据异常的方法,包括:获取预定时间段内的多个数据包,所述多个数据包具有相同数据结构;获取具有所述相同数据结构的历史数据的历史分布;将所述多个数据包与所述历史分布进行比对;根据比对结果,确定是否存在数据异常。
根据一种实施方式,将多个数据包与所述历史分布进行比对包括:通过将所述多个 数据包代入所述历史分布,获取所述多个数据包在所述历史分布中的多个分布状态参数;将所述多个分布状态参数与预定的与分布状态相关的阈值进行比对,确定超出所述阈值的数据包的数目。相应地,所述根据比对结果,确定是否存在数据异常包括:根据所述超出所述阈值的数据包的数目,确定是否存在数据异常。
根据一种实施方式,将所述多个数据包与所述历史分布进行比对包括:确定所述多个数据包的数据分布状态作为当前分布;将所述当前分布与所述历史分布进行比对。
在一种实施例中,将所述当前分布与所述历史分布进行比对包括:确定所述当前分布的分布中心;获取所述历史分布的分布中心;确定所述当前分布的分布中心与历史分布的分布中心之间的偏移。相应地,所述根据比对结果,确定是否存在数据异常包括:响应于所述偏移超出预定偏移阈值,确定存在数据异常。
在另一实施例中,将所述当前分布与所述历史分布进行比对包括:确定随机抽取的数据包在所述当前分布中的分布状态参数,即第一参数;确定该随机抽取的数据包在所述历史分布中的分布状态参数,即第二参数;确定所述第一参数和第二参数的差值。相应地,所述根据比对结果,确定是否存在数据异常包括:响应于所述差值超出预定差值阈值,确定存在数据异常。
根据一种实施方式,所述历史分布是采用混合高斯模型对所述历史数据进行处理得到的历史概率分布;相应地,当前分布体现为采用混合高斯模型对当前多个数据包处理得到的当前概率分布;上述分布状态参数可以体现为概率值;分布中心体现为概率分布曲线的峰值位置。
根据另一种实施方式,所述历史分布是针对所述历史数据采用聚类算法获得的历史聚类分布;相应地,当前分布体现为采用同样的聚类算法对当前多个数据包处理得到的当前聚类分布;上述分布状态参数可以体现为聚类分布空间中数据包的位置和对应聚类中心的距离;分布中心体现为聚类分布的中心位置。
根据第二方面,提供一种确定数据异常的装置,包括:数据包获取单元,配置为获取预定时间段内的多个数据包,所述多个数据包具有相同数据结构;历史获取单元,配置为获取具有所述相同数据结构的历史数据的历史分布;比对单元,配置为将所述多个数据包与所述历史分布进行比对;确定单元,配置为根据比对结果,确定是否存在数据异常。
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述 计算机程序在计算机中执行时,令计算机执行第一方面的方法。
根据第四方面,提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现上述第一方面的方法。
通过本说明书实施例提供的方法及装置,将当前获得的数据包与基于历史数据统计获得的历史分布进行比对,根据比对结果确定当前数据包是否存在数据异常。如此,可以不依赖于数据的业务含义,而有效地对数据异常进行判断和预警。
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1示出本说明书披露的一个实施例的示意图;
图2示出根据一个实施例的方法的流程图;
图3示出根据一个实施例的比对和判断过程流程图;
图4示出根据另一个实施例的比对和判断过程流程图;
图5示出比对峰值位置的示意图;
图6示出根据又一个实施例的比对和判断过程流程图;
图7示出比对概率值的示意图;
图8示出根据一个实施例的确定装置的示意框图。
下面结合附图,对本说明书提供的方案进行描述。
图1为本说明书披露的一个实施例的示意图。在图1中,计算平台、例如支付宝服务器获取预定时间段内的多个数据包,例如用户请求贷款的信贷请求数据包。这些数据包具有相同的数据结构,例如具有相同的字段,字段内容可以包括用户年龄、性别、收入、贷款额等等。另一方面,计算平台还获取具有以上相同数据结构的历史数据的历史分布,这些历史数据可以由之前一段较长时间段内接收的大量同类数据包构成。接着, 计算平台将多个数据包与所述历史分布进行比对,根据比对结果确定是否存在数据异常。如果不存在数据异常,计算平台可以继续处理这些数据,或者将这些数据发送到下一业务环节。如果确定存在数据异常,则可以启动预警,通知相关人员分析数据异常的原因,触发相关解决方案。下面描述确定数据异常的具体实施过程。
图2示出根据一个实施例的方法流程图。该方法的执行主体可以是任何具有计算能力和处理能力的计算平台,例如服务器。如图2所示,该方法包括:步骤21,获取预定时间段内的多个数据包,所述多个数据包具有相同数据结构;步骤22,获取具有所述相同数据结构的历史数据的历史分布;步骤23,将所述多个数据包与所述历史分布进行比对;步骤24,根据比对结果确定是否存在数据异常。下面结合具体例子描述以上各个步骤的执行方式。
首先在步骤21,获取预定时间段内的多个数据包。
在一个实施例中,所述多个数据包是从外部机构接收到的数据包,例如支付宝服务器从银行或金融机构接收到的用户的信贷请求数据包。此时,可以通过后续的步骤判断外部机构传输过来的数据包是否存在数据异常。这样的数据异常有可能是因为数据传送过程中被攻击、被篡改,或者用户群体发生偏移等原因所致。
在另一实施例中,所述多个数据包是在数据处理的业务链中,某个业务环节产生的数据包。例如,对于从外部机构接收到的信贷请求数据包,需要对其进行多个业务环节的处理,包括字段解析、降维、模型运算等等。此时,可以对任意业务环节中产生的数据包进行后续分析。通过后续分析,可以判断该业务环节的数据包是否存在数据异常。某个业务环节的数据包存在数据异常有可能是由于业务环节之间的数据传输出现问题,或者该业务环节的系统或模型出现问题所致。因此,对中间业务环节的数据包进行异常判断,还可以帮助确定系统或模型是否存在异常。
可以理解,以上所述的多个数据包具有相同的数据结构。更具体而言,每个数据包可以具有相同的字段。在一个例子中,上述数据包为信贷请求数据包,此时每个数据包的字段可以包括用户年龄、性别、收入、贷款额等等。在另一例子中,上述数据包为用户操作记录,用于分析用户的行为模式。此时,每个数据包的字段可以包括用户ID、操作行为、操作对象、操作时间等等。
在一个实施例中,在步骤21中获取的多个数据包为距离当前时刻预定时间段内的数据包。上述预定时间段可被设置为较短的时间段,从而使得获得的多个数据包为最近产 生的数据包。例如,在数据量较大的情况下,上述预定时间段可以预设为0.5s或1s;在数据量相对较小的情况下,上述预定时间段可以预设为30s或者1min。获取的数据包的数目也可以根据业务需要进行选择。在一个实施例中,获取上述预定时间段内的全部数据包。在另一实施例中,获取上述预定时间段内预定数目的数据包。
在获得了上述具有相同结构的多个数据包的基础上,另一方面,在步骤22,获取历史数据的历史分布。可以理解,上述历史数据与步骤21中获得的数据包具有相同的数据结构。并且,一般地,历史数据包括之前一段较长时间段内、例如一周或一个月等接收的大量同类数据包。在一个实施例中,上述历史数据与步骤21中获得的数据包具有相同的数据来源。例如,在步骤21中的数据包是来自外部机构的信贷请求数据包的情况下,对应地,上述历史数据也是来自外部机构的、之前一段较长时间段内接收到的信贷请求数据包。在步骤21中的数据包是中间业务环节产生的数据包的情况下,对应地,上述历史数据是之前一段较长时间段内、同样的业务环节产生的数据包。
通过对历史数据进行统计、运算,可以获得历史数据的历史分布。
接下来,在步骤23,将当前获得的多个数据包与历史分布进行比对;在步骤24,根据比对结果,确定当前数据包是否存在数据异常。
步骤23中将当前的数据包与历史分布进行比对有多种实施方式,对应地,步骤24根据不同比对方式产生的比对结果进行判断。
在一个实施方式中,采用代入的方式进行比对。图3示出根据一个实施例的比对和判断过程流程图。可以理解,图3所示的流程是图2中步骤23和24的子步骤。如图3所示,该流程包括以下步骤。在步骤31,将多个数据包代入上述历史分布,从而获取所述多个数据包各自在历史分布中的分布状态参数,得到多个分布状态参数;在步骤32,将上述多个分布状态参数与预定的与分布状态相关的阈值进行比对,确定超出阈值的数据包的数目;接着,在步骤33,根据所述超出阈值的数据包的数目,确定是否存在数据异常。
以上的流程是将当前获取的多个数据包代入到历史分布中,确定这多个数据包在历史分布中的分布状态,如果出现大量(超出阈值)的异常状态,则认为当前存在数据异常。
如上所述,历史分布是对历史数据进行统计、运算而获得的数据分布状态。可以采用不同算法,基于历史数据获得历史分布。下面结合高斯混合模型和聚类算法的例子, 描述在不同形式的历史分布的情况下,图3流程的具体执行过程。
在一个实施例中,上述历史分布是采用混合高斯模型(Gaussian Mixture Model,GMM)对历史数据进行处理得到的历史概率分布。可以理解,混合高斯模型GMM是用高斯概率密度函数精确地量化事物的模型,它可以将一个事物分解为若干的基于高斯概率密度函数的形式。高斯概率密度函数通常表现为正态分布曲线的形式。如前所述,不管是当前获得的各个数据包,还是历史数据,均具有相同的数据结构,该数据结构包含若干字段。在一个例子中,可以采用混合高斯模型,以数据字段为维度进行分解,从而确定出混合高斯概率密度函数作为历史概率分布。
在历史分布体现为高斯概率分布的情况下,图3的流程具体可以如下执行。在步骤31,将当前获取的多个数据包代入历史概率分布,获取多个数据包分别在所述历史概率分布中的概率值,得到多个概率值。也就是说,在历史分布为高斯概率分布的情况下,分布状态参数体现为分布概率值。假定获取了N个数据包,那么在步骤31,可以确定这N个数据包在历史概率分布中的概率值分别为p
1,p
2,…p
N。
在步骤32,将确定的多个概率值与预定的概率阈值进行比对,确定概率值小于所述概率阈值的数据包的数目,即第一数目。一般地,该预定的概率阈值为一较小的概率值p
0。概率阈值p
0的具体大小可以根据业务需要进行设置。在该步骤中,即确定概率值p
1,p
2,…p
N中小于概率阈值p
0的数目M1(M1<=N)。也就是说,确定当前数据包中有多少数据包为“小概率”数据包。
接着,在步骤33,根据所述第一数目M1确定是否存在数据异常。更具体地,在一个例子中,如果第一数目M1超出预定数目阈值M0,则确定存在数据异常。也就是说,一旦当前获取的N个数据包中有超过M0个数据包为“小概率”数据包,则认为当前存在异常。在另一例子中,如果第一数目M1与所述多个数据包的数目N的比例M1/N超出预定比例阈值,则确定存在数据异常。也就是说,一旦N个数据包有超过预定比例的数据包为“小概率”数据包,则认为当前存在数据异常。
如此,确定当前获取的数据包中“小概率”数据包的数目或比例,当该数目或比例超出一定阈值时,就可以认定,当前存在数据异常。
下面描述历史分布为聚类分布的情况下,图3的执行方式。可以理解,聚类算法是对多个样本进行相似性分析和数据统计的算法。对于具有相同数据结构的数据包而言,可以以数据包中的字段为维度,将每个数据包映射为多维空间中的点,然后采用聚类算 法将多个数据包进行划分和聚类。在聚类过程中,可以采取基于层次的聚类算法,例如BIRCH算法、CURE算法等,或者采取基于密度的聚类算法,例如DBSCAN算法、OPTICS算法等等,以及其他可能的聚类算法。
在历史分布体现为聚类分布的情况下,图3的流程具体可以如下执行。在步骤31,确定当前获取的多个数据包各自在历史聚类分布的分布空间中的位置,得到多个位置,并确定所述多个位置与所述历史聚类分布的对应的聚类中心位置之间的距离,得到多个距离。如前所述,每个数据包可以映射为以字段为维度构成的多维空间中的点,该多维空间也就是历史聚类分布的分布空间。相应地,可以通过上述映射确定出当前获取的N个数据包在该历史聚类分布的空间中的位置P
1,P
2,…P
N。进一步地,还可以确定历史聚类分布中聚类中心的位置,然后确定出各个位置P
1,P
2,…P
N与历史聚类分布对应的聚类中心位置之间的距离D
1,D
2,…D
N。可以理解,上述距离为多维空间中多维向量之间的距离。
接着,在步骤32,将所述多个距离与预设距离阈值进行比对,确定超出所述预设距离阈值的数目,即第二数目。一般地,聚类中心位于数据分布最为密集的区域,与聚类中心的距离越远,对应数据包的出现频率或概率越低。因此,可以将上述预定距离阈值设置为一较大的距离D
0。距离阈值D
0的具体大小可以根据业务需要进行设置。相应地,在该步骤中,即确定距离D
1,D
2,…D
N中大于距离阈值D
0的数目M2(M2<=N)。也就是说,确定当前获取的数据包中有多少数据包为远离聚类中心的数据包。
然后,在步骤33,根据上述第二数目确定是否存在数据异常。具体地,在一个例子中,如果所述第二数目M2超出预定数目阈值M0,则认为存在数据异常。也就是说,一旦当前获取的N个数据包中有超过M0个数据包远离聚类中心,则认为当前存在数据异常。在另一例子中,如果第二数目M2与所述多个数据包的数目N的比例M2/N超出预定比例阈值,则确定存在数据异常。也就是说,一旦当前获取的N个数据包有超过预定比例的数据包远离聚类中心,则认为当前存在数据异常。
如此,确定当前获取的数据包中远离聚类中心的数据包的数目或比例,当该数目或比例超出一定阈值时,就可以认定,当前存在数据异常。
以上结合混合高斯分布和聚类分布的例子描述了将当前数据包代入历史分布进行比对的实施方式。在当前获取的数据包的数目较多的情况下,例如超过一定数目(比如200个),还可以进一步确定当前获取的多个数据包的分布状态作为当前分布,将当前分布与历史分布进行比对,从而确定数据异常。
具体地,图4示出根据另一个实施例的比对和判断过程流程图。可以理解,图4所示的流程是图2中步骤23和24的子步骤。如图4所示,在步骤41,确定当前获取的多个数据包的数据分布状态作为当前分布;在步骤42,确定当前分布的分布中心;在步骤43,获取历史分布的分布中心;在步骤44,确定当前分布的分布中心与历史分布的分布中心之间的偏移;在步骤45,判断上述偏移是否超出预定偏移阈值,在超出预定偏移阈值的情况下,在步骤46,确定存在数据异常。
以上的流程是将当前分布的分布中心与历史分布的分布中心进行比对,当两者之间出现超过预定阈值的偏移时,则认为当前存在数据异常。下面仍然结合高斯混合模型和聚类算法的例子,描述在不同形式的历史分布的情况下,图4流程的具体执行过程。
在一个实施例中,历史分布是采用混合高斯模型对历史数据进行处理得到的历史概率分布。相应地,在步骤41,对当前获取的多个数据包采取同样的混合高斯模型进行处理,得到当前数据包的混合高斯分布作为当前概率分布。接着,在步骤42,确定第一峰值位置P1,该第一峰值位置为所述当前概率分布对应的曲线的峰值位置。在步骤43,获取第二峰值位置P2,所述第二峰值位置为历史概率分布对应的曲线的峰值位置。换而言之,在高斯概率分布的情况下,分布中心体现为概率分布曲线的峰值位置。
然后,在步骤44,确定第一峰值位置P1和第二峰值位置P2之间的位置偏移ΔP。并且,在步骤45,判断所述位置偏移ΔP是否超出预定偏移阈值,如果位置偏移超出预定阈值,则确定当前存在数据异常。
图5示出以上比对峰值位置的示意图。在图5中,V1为示意性示出当前概率分布的曲线,V2为示意性示出历史概率分布的曲线。可以看到,曲线V1的峰值位置示意性示出为P1,曲线V2的峰值位置示意性示出为P2。如果峰值位置P1和P2之间的偏移ΔP大于预设阈值,那么可以认为,当前概率分布相比于历史概率分布发生了较大变化,因此当前有可能存在数据异常。
可以理解的是,由于数据包中可能包含更多字段,相应地,混合高斯模型可能基于更多维度进行数据统计,实际的概率分布曲线图是立体多维的曲线图。图5仅仅是一种简化的示意。
以上描述了在高斯分布的情况下,图4所示的比对分布中心的执行方式。在历史分布为采用聚类算法获得的聚类分布的情况下,可以类似地执行图4的流程。具体而言,在步骤41,采用相同的聚类算法,确定当前获取的多个数据包的聚类分布作为当前聚类 分布。在步骤42,确定第一中心位置,所述第一中心位置为所述当前聚类分布对应的聚类的中心位置;在步骤43,获取第二中心位置,所述第二中心位置为所述历史聚类分布对应的聚类的中心位置。换而言之,在聚类分布的情况下,分布中心体现为聚类分布的中心位置。可以理解,尽管不同的聚类算法可以将数据聚类为多个不同的簇或聚类,但是总是可以通过再次聚类或求重心的方式确定出整个聚类分布的中心。如此,分别确定出上述的第一中心位置和第二中心位置。
接着,在步骤44,确定上述第一中心位置和第二中心位置之间的距离D。可以理解,该距离可以是多维空间中多维点或多维向量之间的距离。
然后,在步骤45,判断上述距离D是否超出预定距离阈值。如果D超出距离阈值,则确定存在数据异常。
在以上方式中,比对当前聚类分布的中心位置和历史聚类分布的中心位置,如果聚类分布的中心发生了较大偏移,则认为当前存在数据异常。
除了比对分布中心之外,还可以随机抽取一个样本,比对同一个样本在不同分布中的状态参数,以此确定是否存在数据异常。图6示出根据又一个实施例的比对和判断过程流程图。可以理解,图6所示的流程是图2中步骤23和24的子步骤。如图6所示,在步骤61,确定当前获取的多个数据包的数据分布状态作为当前分布;在步骤62,确定随机抽取的数据包在所述当前分布中的分布状态参数,即第一参数;在步骤63,确定该随机抽取的数据包在所述历史分布中的分布状态参数,即第二参数;在步骤64,确定所述第一参数和第二参数之间的差值;然后,在步骤65,判断所述差值是否超出预定差值阈值,在超出预定差值阈值的情况下,在步骤66,确定存在数据异常。
以上的流程是随机抽取一个样本,比对该样本在当前分布和历史分布中各自的状态参数,当两者之间的差值超过预定阈值时,则认为当前存在数据异常。下面仍然结合高斯混合模型和聚类算法的例子,描述在不同形式的历史分布的情况下,图6流程的具体执行过程。
在一个实施例中,历史分布是采用混合高斯模型对历史数据进行处理得到的历史概率分布。相应地,图6的流程可以如下执行。在步骤61,采用混合高斯模型确定当前获取的多个数据包的混合高斯概率分布作为当前概率分布。在步骤62,确定随机抽取的数据包在当前概率分布中的出现概率,即第一概率p1;在步骤63,确定该随机抽取的数据包在历史概率分布中的出现概率,即第二概率p2。然后,在步骤64,确定第一概率 p1和第二概率p2之间的概率差值Δp。在步骤65,判断所述概率差值Δp是否超出预定差值阈值,在超出预定差值阈值的情况下,确定当前存在数据异常。
图7示出以上比对概率值的示意图。在图7中,V1为示意性示出当前概率分布的曲线,V2为示意性示出历史概率分布的曲线。可以看到,随机抽取的数据包R在当前概率分布曲线V1中的对应概率为p1,在历史概率分布曲线V2中的对应概率为p2。如果概率p1和p2之间的差值Δp大于预设阈值,那么可以认为,当前概率分布相比于历史概率分布发生了较大变化,因此当前有可能存在异常。
可以理解,混合高斯模型一般基于多维度进行数据统计,实际的分布曲线图是立体多维的曲线图。图7仅仅是一种简化的示意。
以上描述了在高斯分布的情况下,图6所示的比对同一样本的状态参数的执行方式。在历史分布为采用聚类算法获得的聚类分布的情况下,可以类似地执行图6的流程。具体而言,在步骤61,采用相同的聚类算法确定当前获取的多个数据包的聚类分布作为当前聚类分布。在步骤62,确定随机抽取的数据包在当前聚类分布中的位置与当前聚类分布对应的聚类的中心之间的距离,即第一距离D1;在步骤63,确定该随机抽取的数据包在历史聚类分布中的位置与历史聚类分布对应的聚类的中心之间的距离,即第二距离D2。可以理解,上述随机抽取的数据包可以映射为聚类分布空间中的点,上述第一距离D1和第二距离D2为该随机抽取的数据包所映射的点与对应聚类分布的中心之间的距离,可通过多维空间中多维向量之间的距离计算方式获得。
在步骤64,确定所述第一距离D1和第二距离D2之间的距离差值ΔD。在步骤65,判断上述距离差值是否超出预定差值阈值,如果超出该差值阈值,则确定当前存在数据异常。
在以上方式中,比对同一样本在当前聚类分布中远离中心的距离和在历史聚类分布中远离中心的距离,如果两者出现较大偏差,则认为当前聚类分布与历史聚类分布存在较大差异,进而确定当前存在异常。
尽管以上结合混合高斯模型和聚类分布进行了描述,但是,历史分布的获得并不局限这两个例子,还可以采用其他算法,只要能够分析统计得到历史数据的分布规律即可。此外,尽管以上结合图3、图4和图6的具体流程描述了将当前获取的数据包与历史分布进行比对的方式,但是可以理解,比对的方式并不限于这些具体描述的例子。根据历史分布的具体形式,还可以对其他的分布状态参数进行比对。
通过以上实施例描述的方法,即使当前获取的数据包被私密保护而失去业务含义,也可以通过统计历史数据的分布规律,将当前获取的数据包与历史分布进行比对而确定当前是否存在数据异常,从而更有效地发现数据异常,进而进行预警和干预。
根据另一方面的实施例,还提供一种确定数据异常的装置。图8示出根据一个实施例的确定装置的示意框图。如图8所示,确定装置800包括:数据包获取单元81,配置为获取预定时间段内的多个数据包,所述多个数据包具有相同数据结构;历史获取单元82,配置为获取具有所述相同数据结构的历史数据的历史分布;比对单元83,配置为将所述多个数据包与所述历史分布进行比对;确定单元84,配置为根据比对结果确定是否存在数据异常。
在一个实施例中,上述比对单元83包括(左侧虚线示出):代入模块831,配置为通过将所述多个数据包代入所述历史分布,获取所述多个数据包各自在所述历史分布中的分布状态参数,得到多个分布状态参数;阈值比对模块832,配置为将所述多个分布状态参数与预定的与分布状态相关的阈值进行比对,确定分布状态参数超出所述阈值的数据包的数目。相应地,确定单元84配置为:根据所述分布状态参数超出所述阈值的数据包的数目,确定是否存在数据异常。
在另一实施例中,上述比对单元83包括(右侧虚线示出):分布确定模块833,配置为确定所述多个数据包的数据分布状态作为当前分布;分布比对模块834,配置为将所述当前分布与所述历史分布进行比对。
在一个实施例中,所述分布比对模块834配置为:确定所述当前分布的分布中心,以及获取所述历史分布的分布中心;确定所述当前分布的分布中心与所述历史分布的分布中心之间的偏移。相应地,确定单元84配置为:响应于所述偏移超出预定偏移阈值,确定存在数据异常。
在一个实施例中,所述分布比对模块834配置为:确定随机抽取的数据包在所述当前分布中的分布状态参数,即第一参数;确定该随机抽取的数据包在所述历史分布中的分布状态参数,即第二参数;确定所述第一参数和第二参数之间的差值。相应地,确定单元84配置为:响应于所述差值超出预定差值阈值,确定存在数据异常。
在一个实施例中,历史分布是采用混合高斯模型对所述历史数据进行处理得到的历史概率分布。相应地,上述代入模块831配置为,通过将所述多个数据包代入所述历史概率分布,获取所述多个数据包各自在所述历史概率分布中的概率值,得到多个概率值; 阈值比对模块832配置为,将所述多个概率值与预定的概率阈值进行比对,确定概率值小于所述概率阈值的数据包的数目,即第一数目。确定单元84配置为:响应于所述第一数目超出预定数目阈值,或者所述第一数目与所述多个数据包的数目之间的比例超出预定比例阈值,确定存在数据异常。
在一个实施例中,在历史分布为高斯概率分布的情况下,上述分布确定模块833配置为:采用混合高斯模型对所述多个数据包进行处理得到当前概率分布;分布比对模块834配置为:将所述当前概率分布与所述历史概率分布进行比对。
更具体地,在一个例子中,所述分布比对模块834配置为:确定第一峰值位置,所述第一峰值位置为所述当前概率分布对应的曲线的峰值位置;获取第二峰值位置,所述第二峰值位置为所述历史概率分布对应的曲线的峰值位置;确定所述第一峰值位置和第二峰值位置之间的位置偏移。相应地,确定单元84配置为:响应于所述位置偏移超出预定偏移阈值,确定存在数据异常。
在另一例子中,分布比对模块834配置为:确定随机抽取的数据包在所述当前概率分布中的出现概率,即第一概率;确定该随机抽取的数据包在所述历史概率分布中的出现概率,即第二概率;确定所述第一概率和第二概率之间的概率差值。相应地,确定单元84配置为:响应于所述概率差值超出预定差值阈值,确定存在数据异常。
在一个实施例中,历史分布是针对所述历史数据采用聚类算法获得的历史聚类分布。相应地,上述代入模块831配置为:确定所述多个数据包各自在所述历史聚类分布的分布空间中的位置,得到多个位置;确定所述多个位置各自与所述历史聚类分布对应的聚类中心位置之间的距离,得到多个距离;阈值比对模块832配置为:将所述多个距离与预设距离阈值进行比对,确定距离超出所述预设距离阈值的数据包的数目,即第二数目。对应地,确定单元84配置为:响应于所述第二数目超出预定数目阈值,或者所述第二数目与所述多个数据包的数目之间的比例超出预定比例阈值,确定存在数据异常。
在一个实施例中,在历史分布为历史聚类分布的情况下,上述分布确定模块833配置为:采用所述聚类算法对所述多个数据包进行处理得到当前聚类分布;分布比对模块834配置为:将所述当前聚类分布与所述历史聚类分布进行比对。
更具体地,在一个例子中,所述分布比对模块834配置为:确定第一中心位置,所述第一中心位置为所述当前聚类分布对应的聚类的中心位置;获取第二中心位置,所述第二中心位置为所述历史聚类分布对应的聚类的中心位置;确定所述第一中心位置和第 二中心位置之间的距离。相应地,确定单元84配置为:响应于所述距离超出预定距离阈值,确定存在数据异常。
在另一例子中,所述分布比对模块834配置为:确定随机抽取的数据包在所述当前聚类分布中的位置与所述当前聚类分布对应的聚类的中心之间的距离,即第一距离;确定该随机抽取的数据包在所述历史聚类分布中的位置与所述历史聚类分布对应的聚类的中心之间的距离,即第二距离;确定所述第一距离和第二距离之间的距离差值。相应地,确定单元84配置为:响应于所述距离差值超出预定差值阈值,确定存在数据异常。
通过以上实施例的装置,即使获取的数据包被私密保护而失去业务含义,也可以通过统计历史数据的分布规律,将当前获取的数据包与历史分布进行比对而确定当前是否存在数据异常,从而更有效地发现数据异常,进而进行预警和干预。
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2至图6所描述的方法。
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2至图6所述的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。
Claims (28)
- 一种确定数据异常的方法,包括:获取预定时间段内的多个数据包,所述多个数据包具有相同数据结构;获取具有所述相同数据结构的历史数据的历史分布;将所述多个数据包与所述历史分布进行比对,得到比对结果;根据所述比对结果,确定是否存在数据异常。
- 根据权利要求1所述的方法,其中,将所述多个数据包与所述历史分布进行比对包括:通过将所述多个数据包代入所述历史分布,获取所述多个数据包各自在所述历史分布中的分布状态参数,得到多个分布状态参数;将所述多个分布状态参数与预定的与分布状态相关的阈值进行比对,确定所述分布状态参数超出所述阈值的数据包的数目;所述根据所述比对结果,确定是否存在数据异常包括:根据所述分布状态参数超出所述阈值的数据包的数目,确定是否存在数据异常。
- 根据权利要求1所述的方法,其中,将所述多个数据包与所述历史分布进行比对包括:确定所述多个数据包的数据分布状态作为当前分布;将所述当前分布与所述历史分布进行比对。
- 根据权利要求3所述的方法,其中,将所述当前分布与所述历史分布进行比对包括:确定所述当前分布的分布中心;获取所述历史分布的分布中心;确定所述当前分布的分布中心与所述历史分布的分布中心之间的偏移;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述偏移超出预定偏移阈值,确定存在数据异常。
- 根据权利要求3所述的方法,其中,将所述当前分布与所述历史分布进行比对包括:确定随机抽取的数据包在所述当前分布中的分布状态参数,即第一参数;确定该随机抽取的数据包在所述历史分布中的分布状态参数,即第二参数;确定所述第一参数和所述第二参数之间的差值;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述差值超出预定差值阈值,确定存在数据异常。
- 根据权利要求2所述的方法,其中,所述历史分布是采用混合高斯模型对所述历 史数据进行处理得到的历史概率分布;将所述多个数据包与所述历史分布进行比对包括:通过将所述多个数据包代入所述历史概率分布,获取所述多个数据包各自在所述历史概率分布中的概率值,得到多个概率值;将所述多个概率值与预定的概率阈值进行比对,确定所述概率值小于所述概率阈值的数据包的数目,即第一数目;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述第一数目超出预定数目阈值,或者所述第一数目与所述多个数据包的数目之间的比例超出预定比例阈值,确定存在数据异常。
- 根据权利要求3所述的方法,其中,所述历史分布是采用混合高斯模型对所述历史数据进行处理得到的历史概率分布;所述确定所述多个数据包的数据分布状态作为所述当前分布包括:采用混合高斯模型对所述多个数据包进行处理得到当前概率分布;将所述当前分布与所述历史分布进行比对包括:将所述当前概率分布与所述历史概率分布进行比对。
- 根据权利要求7所述的方法,其中,将所述当前概率分布与所述历史概率分布进行比对包括:确定第一峰值位置,所述第一峰值位置为所述当前概率分布对应的曲线的峰值位置;获取第二峰值位置,所述第二峰值位置为所述历史概率分布对应的曲线的峰值位置;确定所述第一峰值位置和第二峰值位置之间的位置偏移;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述位置偏移超出预定偏移阈值,确定存在数据异常。
- 根据权利要求7所述的方法,其中,将所述当前概率分布与所述历史概率分布进行比对包括:确定随机抽取的数据包在所述当前概率分布中的出现概率,即第一概率;确定该随机抽取的数据包在所述历史概率分布中的出现概率,即第二概率;确定所述第一概率和第二概率之间的概率差值;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述概率差值超出预定差值阈值,确定存在数据异常。
- 根据权利要求2所述的方法,其中,所述历史分布是针对所述历史数据采用聚类算法获得的历史聚类分布;将所述多个数据包与所述历史分布进行比对包括:确定所述多个数据包各自在所述历史聚类分布的分布空间中的位置,得到多个位置;确定所述多个位置与所述历史聚类分布对应的聚类中心位置之间的距离,得到多个距离;将所述多个距离与预设距离阈值进行比对,确定所述距离超出所述预设距离阈值的数目,即第二数目;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述第二数目超出预定数目阈值,或者所述第二数目与所述多个数据包的数目之间的比例超出预定比例阈值,确定存在数据异常。
- 根据权利要求3所述的方法,其中,所述历史分布是针对所述历史数据采用聚类算法获得的历史聚类分布;所述确定所述多个数据包的数据分布状态作为所述当前分布包括:采用所述聚类算法对所述多个数据包进行处理得到当前聚类分布;将所述当前分布与所述历史分布进行比对包括:将所述当前聚类分布与所述历史聚类分布进行比对。
- 根据权利要求11所述的方法,其中,将所述当前聚类分布与所述历史聚类分布进行比对包括:确定第一中心位置,所述第一中心位置为所述当前聚类分布对应的聚类的中心位置;获取第二中心位置,所述第二中心位置为所述历史聚类分布对应的聚类的中心位置;确定所述第一中心位置和第二中心位置之间的距离;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述距离超出预定距离阈值,确定存在数据异常。
- 根据权利要求11所述的方法,其中,将所述当前聚类分布与所述历史聚类分布进行比对包括:确定随机抽取的数据包在所述当前聚类分布中的位置与所述当前聚类分布对应的聚类的中心之间的距离,即第一距离;确定该随机抽取的数据包在所述历史聚类分布中的位置与所述历史聚类分布对应的聚类的中心之间的距离,即第二距离;确定所述第一距离和第二距离之间的距离差值;所述根据所述比对结果,确定是否存在数据异常包括:响应于所述距离差值超出预定差值阈值,确定存在数据异常。
- 一种确定数据异常的装置,包括:数据包获取单元,配置为获取预定时间段内的多个数据包,所述多个数据包具有相同数据结构;历史获取单元,配置为获取具有所述相同数据结构的历史数据的历史分布;比对单元,配置为将所述多个数据包与所述历史分布进行比对,得到比对结果;确定单元,配置为根据所述比对结果,确定是否存在数据异常。
- 根据权利要求14所述的装置,其中,所述比对单元包括:代入模块,配置为通过将所述多个数据包代入所述历史分布,获取所述多个数据包在所述历史分布中的分布状态参数,得到多个分布状态参数;阈值比对模块,配置为将所述多个分布状态参数与预定的与分布状态相关的阈值进行比对,确定所述分布状态参数超出所述阈值的数据包的数目;所述确定单元配置为:根据所述分布状态参数超出所述阈值的数据包的数目,确定是否存在数据异常。
- 根据权利要求14所述的装置,其中,所述比对单元包括:分布确定模块,配置为确定所述多个数据包的数据分布状态作为当前分布;分布比对模块,配置为将所述当前分布与所述历史分布进行比对。
- 根据权利要求16所述的装置,其中,所述分布比对模块配置为:确定所述当前分布的分布中心,以及获取所述历史分布的分布中心;确定所述当前分布的分布中心与历史分布的分布中心之间的偏移;所述确定单元配置为:响应于所述偏移超出预定偏移阈值,确定存在数据异常。
- 根据权利要求16所述的装置,其中,所述分布比对模块配置为:确定随机抽取的数据包在所述当前分布中的分布状态参数,即第一参数;确定该随机抽取的数据包在所述历史分布中的分布状态参数,即第二参数;确定所述第一参数和所述第二参数之间的差值;所述确定单元配置为:响应于所述差值超出预定差值阈值,确定存在数据异常。
- 根据权利要求15所述的装置,其中,所述历史分布是采用混合高斯模型对所述 历史数据进行处理得到的历史概率分布;所述代入模块配置为,通过将所述多个数据包代入所述历史概率分布,获取所述多个数据包各自在所述历史概率分布中的概率值,得到多个概率值;所述阈值比对模块配置为,将所述多个概率值与预定的概率阈值进行比对,确定所述概率值小于所述概率阈值的数据包的数目,即第一数目;所述确定单元配置为:响应于所述第一数目超出预定数目阈值,或者所述第一数目与所述多个数据包的数目之间的比例超出预定比例阈值,确定存在数据异常。
- 根据权利要求16所述的装置,其中,所述历史分布是采用混合高斯模型对所述历史数据进行处理得到的历史概率分布;所述分布确定模块配置为:采用混合高斯模型对所述多个数据包进行处理得到当前概率分布;所述分布比对模块配置为:将所述当前概率分布与所述历史概率分布进行比对。
- 根据权利要求20所述的装置,其中,所述分布比对模块配置为:确定第一峰值位置,所述第一峰值位置为所述当前概率分布对应的曲线的峰值位置;获取第二峰值位置,所述第二峰值位置为所述历史概率分布对应的曲线的峰值位置;确定所述第一峰值位置和第二峰值位置之间的位置偏移;所述确定单元配置为:响应于所述位置偏移超出预定偏移阈值,确定存在数据异常。
- 根据权利要求20所述的装置,其中,所述分布比对模块配置为:确定随机抽取的数据包在所述当前概率分布中的出现概率,即第一概率;确定该随机抽取的数据包在所述历史概率分布中的出现概率,即第二概率;确定所述第一概率和所述第二概率之间的概率差值;所述确定单元配置为:响应于所述概率差值超出预定差值阈值,确定存在数据异常。
- 根据权利要求15所述的装置,其中,所述历史分布是针对所述历史数据采用聚类算法获得的历史聚类分布;所述代入模块配置为:确定所述多个数据包各自在所述历史聚类分布的分布空间中的位置,得到多个位置;确定所述多个位置与所述历史聚类分布对应的聚类中心位置之间的距离,得到多个距离;所述阈值比对模块配置为:将所述多个距离与预设距离阈值进行比对,确定所述距 离超出所述预设距离阈值的数据包的数目,即第二数目;所述确定单元配置为:响应于所述第二数目超出预定数目阈值,或者所述第二数目与所述多个数据包的数目之间的比例超出预定比例阈值,确定存在数据异常。
- 根据权利要求16所述的装置,其中,所述历史分布是针对所述历史数据采用聚类算法获得的历史聚类分布;所述分布确定模块配置为:采用所述聚类算法对所述多个数据包进行处理得到当前聚类分布;所述分布比对模块配置为:将所述当前聚类分布与所述历史聚类分布进行比对。
- 根据权利要求24所述的装置,其中,所述分布比对模块配置为:确定第一中心位置,所述第一中心位置为所述当前聚类分布对应的聚类的中心位置;获取第二中心位置,所述第二中心位置为所述历史聚类分布对应的聚类的中心位置;确定所述第一中心位置和第二中心位置之间的距离;所述确定单元配置为:响应于所述距离超出预定距离阈值,确定存在数据异常。
- 根据权利要求24所述的装置,其中,所述分布比对模块配置为:确定随机抽取的数据包在所述当前聚类分布中的位置与所述当前聚类分布对应的聚类的中心之间的距离,即第一距离;确定该随机抽取的数据包在所述历史聚类分布中的位置与所述历史聚类分布对应的聚类的中心之间的距离,即第二距离;确定所述第一距离和所述第二距离之间的距离差值;所述确定单元配置为:响应于所述距离差值超出预定差值阈值,确定存在数据异常。
- 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-13中任一项的所述的方法。
- 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-13中任一项所述的方法。
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| SG11202001225RA SG11202001225RA (en) | 2017-12-29 | 2018-11-19 | Method and device for determining data anomaly |
| EP18896044.7A EP3654611B1 (en) | 2017-12-29 | 2018-11-19 | Methods, computer-readable storage medium and device for determining data anomaly |
| US16/810,961 US10917424B2 (en) | 2017-12-29 | 2020-03-06 | Method and device for determining data anomaly |
| US16/911,078 US10917426B2 (en) | 2017-12-29 | 2020-06-24 | Method and device for determining data anomaly |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201711474464.6A CN110110160B (zh) | 2017-12-29 | 2017-12-29 | 确定数据异常的方法及装置 |
| CN201711474464.6 | 2017-12-29 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/810,961 Continuation US10917424B2 (en) | 2017-12-29 | 2020-03-06 | Method and device for determining data anomaly |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019128525A1 true WO2019128525A1 (zh) | 2019-07-04 |
Family
ID=67066389
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/116085 Ceased WO2019128525A1 (zh) | 2017-12-29 | 2018-11-19 | 确定数据异常的方法及装置 |
Country Status (6)
| Country | Link |
|---|---|
| US (2) | US10917424B2 (zh) |
| EP (1) | EP3654611B1 (zh) |
| CN (1) | CN110110160B (zh) |
| SG (1) | SG11202001225RA (zh) |
| TW (1) | TWI703454B (zh) |
| WO (1) | WO2019128525A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114444924A (zh) * | 2022-01-25 | 2022-05-06 | 国网北京市电力公司 | 一种智能化风险辨识防控方法、系统、装置及存储介质 |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110110160B (zh) | 2017-12-29 | 2020-04-14 | 阿里巴巴集团控股有限公司 | 确定数据异常的方法及装置 |
| US10984122B2 (en) | 2018-04-13 | 2021-04-20 | Sophos Limited | Enterprise document classification |
| CN110781220A (zh) * | 2019-09-20 | 2020-02-11 | 江苏欣皓测试技术有限公司 | 故障预警方法、装置、存储介质和电子设备 |
| TWI749416B (zh) * | 2019-11-29 | 2021-12-11 | 中國鋼鐵股份有限公司 | 變轉速設備異常監診方法 |
| CN112329784A (zh) * | 2020-11-23 | 2021-02-05 | 桂林电子科技大学 | 一种基于时空感知及多峰响应的相关滤波跟踪方法 |
| CN113048807B (zh) * | 2021-03-15 | 2022-07-26 | 太原理工大学 | 一种空冷机组背压异常检测方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103441982A (zh) * | 2013-06-24 | 2013-12-11 | 杭州师范大学 | 一种基于相对熵的入侵报警分析方法 |
| US20160219067A1 (en) * | 2015-01-28 | 2016-07-28 | Korea Internet & Security Agency | Method of detecting anomalies suspected of attack, based on time series statistics |
| CN106101102A (zh) * | 2016-06-15 | 2016-11-09 | 华东师范大学 | 一种基于pam聚类算法的网络异常流量检测方法 |
| CN107481117A (zh) * | 2017-08-21 | 2017-12-15 | 掌阅科技股份有限公司 | 异常行为的检测方法、电子设备及计算机存储介质 |
| CN107491970A (zh) * | 2017-08-17 | 2017-12-19 | 北京三快在线科技有限公司 | 实时反作弊检测监控方法和系统以及计算设备 |
Family Cites Families (43)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5705739A (en) * | 1996-08-27 | 1998-01-06 | Levine; Robert A. | Detecting specific medical conditions from erythrocyte density distrubition in a centrifuged anticoagulated whole blood sample |
| US7181765B2 (en) | 2001-10-12 | 2007-02-20 | Motorola, Inc. | Method and apparatus for providing node security in a router of a packet network |
| US7213264B2 (en) | 2002-01-31 | 2007-05-01 | Mazu Networks, Inc. | Architecture to thwart denial of service attacks |
| US7254633B2 (en) * | 2002-02-07 | 2007-08-07 | University Of Massachusetts Amherst | Probabilistic packet marking |
| GB0410254D0 (en) * | 2004-05-07 | 2004-06-09 | British Telecomm | Processing of data in networks |
| US7653007B2 (en) * | 2004-06-04 | 2010-01-26 | Alcatel-Lucent Usa Inc. | Per-flow traffic estimation |
| US7519564B2 (en) * | 2004-11-16 | 2009-04-14 | Microsoft Corporation | Building and using predictive models of current and future surprises |
| IN2015MN00459A (zh) | 2005-06-29 | 2015-09-04 | Univ Boston | |
| US20070076611A1 (en) * | 2005-10-05 | 2007-04-05 | Fujitsu Limited | Detecting anomalies from acceptable traffic affected by anomalous traffic |
| US7712134B1 (en) | 2006-01-06 | 2010-05-04 | Narus, Inc. | Method and apparatus for worm detection and containment in the internet core |
| WO2007100916A2 (en) * | 2006-02-28 | 2007-09-07 | The Trustees Of Columbia University In The City Of New York | Systems, methods, and media for outputting a dataset based upon anomaly detection |
| US8248946B2 (en) | 2006-06-06 | 2012-08-21 | Polytechnic Institute of New York Unversity | Providing a high-speed defense against distributed denial of service (DDoS) attacks |
| US8312541B2 (en) * | 2007-07-17 | 2012-11-13 | Cisco Technology, Inc. | Detecting neighbor discovery denial of service attacks against a router |
| EP2227889B1 (en) * | 2007-12-31 | 2011-07-13 | Telecom Italia S.p.A. | Method of detecting anomalies in a communication system using symbolic packet features |
| WO2009083022A1 (en) * | 2007-12-31 | 2009-07-09 | Telecom Italia S.P.A. | Method of detecting anomalies in a communication system using numerical packet features |
| EP2088742B1 (en) * | 2008-02-11 | 2013-04-10 | Universita' degli studi di Brescia | Method for determining if an encrypted flow of packets belongs to a predefined class of flows |
| US9258217B2 (en) | 2008-12-16 | 2016-02-09 | At&T Intellectual Property I, L.P. | Systems and methods for rule-based anomaly detection on IP network flow |
| US8618934B2 (en) * | 2009-04-27 | 2013-12-31 | Kolos International LLC | Autonomous sensing module, a system and a method of long-term condition monitoring of structures |
| TWI367452B (en) * | 2009-08-21 | 2012-07-01 | Shih Chin Lee | Method for detecting abnormal transactions of financial assets and information processing device performing the method |
| US8874763B2 (en) * | 2010-11-05 | 2014-10-28 | At&T Intellectual Property I, L.P. | Methods, devices and computer program products for actionable alerting of malevolent network addresses based on generalized traffic anomaly analysis of IP address aggregates |
| US9106689B2 (en) * | 2011-05-06 | 2015-08-11 | Lockheed Martin Corporation | Intrusion detection using MDL clustering |
| US9628499B1 (en) * | 2012-08-08 | 2017-04-18 | Google Inc. | Statistics-based anomaly detection |
| CN103111982B (zh) | 2013-01-25 | 2015-04-15 | 中国海洋石油总公司 | 一种部件安装装置及拆装装置 |
| US9288220B2 (en) * | 2013-11-07 | 2016-03-15 | Cyberpoint International Llc | Methods and systems for malware detection |
| US20150256431A1 (en) * | 2014-03-07 | 2015-09-10 | Cisco Technology, Inc. | Selective flow inspection based on endpoint behavior and random sampling |
| WO2015167421A1 (en) * | 2014-04-28 | 2015-11-05 | Hewlett-Packard Development Company, L.P. | Network flow classification |
| US9635050B2 (en) * | 2014-07-23 | 2017-04-25 | Cisco Technology, Inc. | Distributed supervised architecture for traffic segregation under attack |
| US9344441B2 (en) | 2014-09-14 | 2016-05-17 | Cisco Technology, Inc. | Detection of malicious network connections |
| US9722906B2 (en) * | 2015-01-23 | 2017-08-01 | Cisco Technology, Inc. | Information reporting for anomaly detection |
| US20170046700A1 (en) * | 2015-08-10 | 2017-02-16 | Ca, Inc. | Anomaly detection and user-context driven authorization request for automatic payments through mobile devices |
| US9953160B2 (en) * | 2015-10-13 | 2018-04-24 | Paypal, Inc. | Applying multi-level clustering at scale to unlabeled data for anomaly detection and security |
| CN105262647A (zh) * | 2015-11-27 | 2016-01-20 | 广州神马移动信息科技有限公司 | 一种异常指标检测方法及装置 |
| US10542026B2 (en) * | 2015-12-15 | 2020-01-21 | Flying Cloud Technologies, Inc. | Data surveillance system with contextual information |
| US9979740B2 (en) * | 2015-12-15 | 2018-05-22 | Flying Cloud Technologies, Inc. | Data surveillance system |
| US10469511B2 (en) | 2016-03-28 | 2019-11-05 | Cisco Technology, Inc. | User assistance coordination in anomaly detection |
| CN105871879B (zh) * | 2016-05-06 | 2019-03-05 | 中国联合网络通信集团有限公司 | 网元异常行为自动检测方法及装置 |
| CN106204335A (zh) | 2016-07-21 | 2016-12-07 | 广东工业大学 | 一种电价执行异常判断方法、装置及系统 |
| US11397792B2 (en) * | 2016-09-08 | 2022-07-26 | Nec Corporation | Anomaly detecting device, anomaly detecting method, and recording medium |
| US10375096B2 (en) | 2016-12-08 | 2019-08-06 | Cisco Technology, Inc. | Filtering onion routing traffic from malicious domain generation algorithm (DGA)-based traffic classification |
| US11190543B2 (en) | 2017-01-14 | 2021-11-30 | Hyprfire Pty Ltd | Method and system for detecting and mitigating a denial of service attack |
| CN107515889A (zh) * | 2017-07-03 | 2017-12-26 | 国家计算机网络与信息安全管理中心 | 一种微博话题实时监测方法与装置 |
| US10686816B1 (en) | 2017-09-28 | 2020-06-16 | NortonLifeLock Inc. | Insider threat detection under user-resource bi-partite graphs |
| CN110110160B (zh) | 2017-12-29 | 2020-04-14 | 阿里巴巴集团控股有限公司 | 确定数据异常的方法及装置 |
-
2017
- 2017-12-29 CN CN201711474464.6A patent/CN110110160B/zh active Active
-
2018
- 2018-10-23 TW TW107137363A patent/TWI703454B/zh active
- 2018-11-19 EP EP18896044.7A patent/EP3654611B1/en active Active
- 2018-11-19 SG SG11202001225RA patent/SG11202001225RA/en unknown
- 2018-11-19 WO PCT/CN2018/116085 patent/WO2019128525A1/zh not_active Ceased
-
2020
- 2020-03-06 US US16/810,961 patent/US10917424B2/en active Active
- 2020-06-24 US US16/911,078 patent/US10917426B2/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103441982A (zh) * | 2013-06-24 | 2013-12-11 | 杭州师范大学 | 一种基于相对熵的入侵报警分析方法 |
| US20160219067A1 (en) * | 2015-01-28 | 2016-07-28 | Korea Internet & Security Agency | Method of detecting anomalies suspected of attack, based on time series statistics |
| CN106101102A (zh) * | 2016-06-15 | 2016-11-09 | 华东师范大学 | 一种基于pam聚类算法的网络异常流量检测方法 |
| CN107491970A (zh) * | 2017-08-17 | 2017-12-19 | 北京三快在线科技有限公司 | 实时反作弊检测监控方法和系统以及计算设备 |
| CN107481117A (zh) * | 2017-08-21 | 2017-12-15 | 掌阅科技股份有限公司 | 异常行为的检测方法、电子设备及计算机存储介质 |
Non-Patent Citations (2)
| Title |
|---|
| QIAN, TENG: "Anomaly Detection for Time Series of Network Activity", MASTER THESIS, no. 2, 15 February 2016 (2016-02-15), pages 1 - 80, XP009519064 * |
| See also references of EP3654611A4 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114444924A (zh) * | 2022-01-25 | 2022-05-06 | 国网北京市电力公司 | 一种智能化风险辨识防控方法、系统、装置及存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20200329063A1 (en) | 2020-10-15 |
| EP3654611A4 (en) | 2020-08-19 |
| US20200213341A1 (en) | 2020-07-02 |
| EP3654611A1 (en) | 2020-05-20 |
| CN110110160B (zh) | 2020-04-14 |
| US10917426B2 (en) | 2021-02-09 |
| EP3654611B1 (en) | 2021-06-23 |
| US10917424B2 (en) | 2021-02-09 |
| TW201931167A (zh) | 2019-08-01 |
| SG11202001225RA (en) | 2020-03-30 |
| TWI703454B (zh) | 2020-09-01 |
| CN110110160A (zh) | 2019-08-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019128525A1 (zh) | 确定数据异常的方法及装置 | |
| US12506754B2 (en) | System and methods for cybersecurity analysis using UEBA and network topology data and trigger-based network remediation | |
| US20200389495A1 (en) | Secure policy-controlled processing and auditing on regulated data sets | |
| US7900194B1 (en) | Kernel-based intrusion detection using bloom filters | |
| US20210092160A1 (en) | Data set creation with crowd-based reinforcement | |
| US20210034759A1 (en) | Systems and methods for attributing security vulnerabilities to a configuration of a client device | |
| US20190306178A1 (en) | Distributed System for Adaptive Protection Against Web-Service-Targeted Vulnerability Scanners | |
| US12003546B1 (en) | System and method for security control over data flows in distributed computing systems | |
| US12289332B2 (en) | Cybersecurity systems and methods for protecting, detecting, and remediating critical application security attacks | |
| US11831608B2 (en) | Application firewalls based on self-modeling service flows | |
| US20250231555A1 (en) | System and method for inferring device type based on port usage | |
| US20210092159A1 (en) | System for the prioritization and dynamic presentation of digital content | |
| US20250131091A1 (en) | Cloud Ransomware Detection | |
| US20240195841A1 (en) | System and method for manipulation of secure data | |
| US10812496B2 (en) | Automatic generation of cluster descriptions | |
| US12572846B2 (en) | System and method for device attribute identification based on host configuration protocols | |
| US20230053322A1 (en) | Script Classification on Computing Platform | |
| WO2019021104A1 (en) | RECOVERING APPLICATION FUNCTIONS VIA AN ANALYSIS OF FUNCTIONAL APPLICATIONS FOR APPLICATION | |
| CN119210841B (zh) | 基于动态信任评估的业务访问方法、装置、计算机设备、计算机可读存储介质和产品 | |
| US20220351210A1 (en) | Method and system for detection of abnormal transactional behavior | |
| US11838313B2 (en) | Artificial intelligence (AI)-based malware detection | |
| KR102311997B1 (ko) | 인공지능 행위분석 기반의 edr 장치 및 방법 | |
| US20240214410A1 (en) | Systems, media, and methods for utilizing a crosswalk algorithm to identify controls across frameworks, and for utilizing identified controls to generate cybersecurity risk assessments | |
| US20220398588A1 (en) | Identifying an unauthorized data processing transaction | |
| US12615280B2 (en) | Detecting polymorphic botnets using an image recognition platform |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18896044 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2018896044 Country of ref document: EP Effective date: 20200213 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |