WO2019218927A1 - 异常用户识别 - Google Patents
异常用户识别 Download PDFInfo
- Publication number
- WO2019218927A1 WO2019218927A1 PCT/CN2019/086232 CN2019086232W WO2019218927A1 WO 2019218927 A1 WO2019218927 A1 WO 2019218927A1 CN 2019086232 W CN2019086232 W CN 2019086232W WO 2019218927 A1 WO2019218927 A1 WO 2019218927A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- data
- class
- initial
- feature vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2433—Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/101—Access control lists [ACL]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/034—Test or assess a computer or a system
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/102—Entity profiles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W12/00—Security arrangements; Authentication; Protecting privacy or anonymity
- H04W12/60—Context-dependent security
- H04W12/68—Gesture-dependent or behaviour-dependent
Definitions
- the network system In the network system, in order to ensure better protection of hardware, software and data in the network system, the network system operates continuously and reliably.
- Security devices are usually erected at the edge routers that connect the internal and external networks. The security device filters and filters the packets sent by the internal network or the packets sent by the external network to ensure the security of the network system.
- the detection of abnormal users becomes complicated due to the unpredictability of user behavior. For example, detection of different kinds of operations performed by users at different time periods and different locations.
- a user frequently sends and receives emails, opens illegal web pages, downloads illegal videos, and the like.
- FIG. 1 is a schematic flowchart of an abnormal user identification method according to an embodiment of the present disclosure
- FIG. 2 is another schematic flowchart of an abnormal user identification method according to an embodiment of the present disclosure
- FIG. 3 is a schematic diagram of a feature system according to an embodiment of the present disclosure.
- FIG. 4 is a distribution diagram of a user class according to an embodiment of the present disclosure.
- FIG. 5 is a schematic diagram of a normal distribution curve according to an embodiment of the present disclosure.
- FIG. 6 is a schematic diagram of a cumulative probability curve according to an embodiment of the present disclosure.
- FIG. 7 is still another schematic flowchart of an abnormal user identification method according to an embodiment of the present disclosure.
- FIG. 8 is a schematic structural diagram of an abnormal user identification apparatus according to an embodiment of the present disclosure.
- FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
- the identification of abnormal users on the intranet can be achieved by setting a blacklist. Specifically, the administrator adds the restricted username to the blacklist.
- the above-mentioned method of setting a blacklist only abnormal users known to the administrator can be identified.
- the above method of setting a blacklist cannot identify users who are unknown to the administrator and cannot find abnormal behavior.
- inventions of the present disclosure provide an abnormal user identification method.
- the abnormal user identification method can be applied to electronic devices such as servers, computers, mobile phones, and security devices.
- electronic devices such as servers, computers, mobile phones, and security devices.
- the execution subject is an electronic device.
- FIG. 1 is a schematic flowchart of an abnormal user identification method according to an embodiment of the present disclosure.
- the abnormal user identification method provided by the embodiment of the present disclosure includes the following steps.
- Step 101 The electronic device acquires user behavior data of the user.
- the electronic device may acquire user behavior data of multiple users, and may also acquire multiple user behavior data of one user. If the electronic device acquires a plurality of user behavior data of a user, the plurality of user behavior data includes at least one historical user behavior data and one current user behavior data.
- the electronic device when it is required to detect an abnormal user, acquires user behavior data of the user.
- the electronic device can obtain user behavior data from the user behavior log.
- the user behavior log is used to record various network behaviors of the user.
- the electronic device can also obtain the user behavior data of the user from the user behavior data input by the user.
- Embodiments of the present disclosure do not limit the form in which an electronic device acquires user behavior data.
- the electronic device can set different time granularities according to various needs for abnormal user identification.
- the electronic device acquires user behavior data of the user within a preset time granularity.
- Step 102 The electronic device extracts multiple feature values of the user behavior data under a preset plurality of behavior dimensions.
- the behavior dimension may be divided to obtain a business layer feature dimension and a behavior layer feature dimension.
- the electronic device can quickly extract feature values in multiple behavior dimensions.
- the service layer feature dimensions may include: instant messaging (English: Instant Messaging, IM for short), web browsing, community forums, traffic, file transfer, and mail.
- the behavior layer feature dimension may include: sending information, receiving information, sending files, file transfer protocol (English: File Transfer Protocol, FTP for short), and hypertext transfer protocol channel targeting security (English: Hyper Text Transfer Protocol over) Secure Socket Layer (referred to as: HTTPS) traffic and receiving mail.
- HTTPS Hyper Text Transfer Protocol over Secure Socket Layer
- the electronic device obtains a plurality of behavioral dimensions by arbitrarily combining the content included in the foregoing two-layer feature dimension.
- the behavioral dimensions obtained by the electronic device include, but are not limited to, the number of IM transmission information, the number of IM reception information, the number of IM transmission files, the size of the IM transmission file, and the like.
- the electronic device extracts a plurality of feature values.
- Step 103 The electronic device determines a feature vector corresponding to the user behavior data according to the plurality of feature values.
- the electronic device combines the plurality of feature values corresponding to the one user behavior data to obtain a feature vector corresponding to the one user behavior data.
- Step 104 Perform clustering on the feature vector by using a preset clustering algorithm to obtain multiple aggregation classes, and obtain a center vector of each aggregation class.
- the preset clustering algorithm may be a K-means clustering algorithm, a K-means Plus clustering algorithm, or the like.
- the electronic device clusters the feature vectors through a preset clustering algorithm to obtain multiple aggregation classes. At least one feature vector is included in each aggregation class.
- the electronic device calculates the mean of the plurality of feature vectors included in the one aggregation class, and uses the mean as the center vector of the one aggregation class.
- Step 105 The electronic device determines the difference feature vector, and the distance value between the difference feature vector and the center vector of the associated aggregation class is not within the preset distance value range.
- the preset distance value range is previously stored in the electronic device.
- the distance value from the center vector of the aggregation class is not within the preset distance value, specifically: the distance between the feature vector in the aggregation class and the center vector of the aggregation class is less than the minimum value of the preset distance value range; or The distance value from the eigenvector of the aggregation class and the center vector of the aggregation class is greater than the maximum value of the range of the preset distance value.
- the electronic device determines that the feature vector is a difference feature vector.
- an aggregation class is taken as an example.
- the electronic device calculates a distance value between each feature vector included in the one aggregation class and a center vector of the one aggregation class. After the electronic device obtains multiple distance values, the plurality of distance values are sorted. The electronic device acquires a distance value that is not within the preset distance value, and uses the feature vector represented by the acquired distance value as the difference feature vector.
- Step 106 The electronic device determines the user represented by the difference feature vector as an abnormal user.
- the electronic device mentioned user user behavior data Q 1 P 1 a plurality of feature values at a plurality of predetermined behavioral dimensions, according to the mentioned plurality of feature values, to determine the user behavior data corresponding to features 1 P Vector 111. If the electronic device determines that the feature vector 111 is a difference feature vector, it is determined that the user Q 1 is determined to be an abnormal user.
- the electronic device acquires the difference feature vector whose distance from the center vector of the aggregation class is not within the preset distance value by performing clustering processing on the feature vector.
- the electronic device realizes recognition of the abnormal user according to the acquired difference feature vector.
- the administrator does not need to add the restricted user name to the blacklist.
- the electronic device does not need to identify the abnormal user by establishing a blacklist. This identifies the user who is unknown to the administrator and cannot find abnormal behavior.
- the user behavior data acquired by the electronic device is user behavior data of multiple users, and the embodiment of the present disclosure provides an abnormal user identification method.
- FIG. 2 is another schematic flowchart of an abnormal user identification method according to an embodiment of the present disclosure, where the method includes the following steps.
- Step 201 The electronic device acquires user behavior data of multiple users.
- the electronic device when it is required to detect an abnormal user, acquires user behavior data of a plurality of users.
- the electronic device can obtain user behavior data of multiple users from the user behavior log.
- the user behavior log is used to record various network behaviors of the user.
- the electronic device may also obtain user behavior data of a plurality of users from user behavior data input by the user.
- Embodiments of the present disclosure do not limit the form in which an electronic device acquires user behavior data.
- the electronic device may acquire user behavior data of different users according to a preset time granularity. Among them, the electronic device can set different time granularities according to various requirements for abnormal user identification.
- the electronic device can set a larger time granularity in advance.
- the preset time granularity of the electronic device may be: one week, one month, and the like.
- the electronic device when identifying a user who has a sudden attack behavior before leaving the service, can set a small time granularity in advance.
- the preset time granularity of the electronic device may be: 10 minutes, 1 hour, 24 hours, and the like.
- the electronic device acquires user behavior data for a plurality of users.
- the time granularity preset by the electronic device is 10 minutes, and the users to be identified include A, B, and C.
- the electronic device can acquire the user behavior data 11 of the user A, the user behavior data 12 of the user B, and the user behavior data 13 of the user C.
- the electronic device may also acquire the user behavior data 21 of the user A, the user behavior data 22 of the user B, and the user behavior data 23 of the user C during the time period indicated by 9:50-10:00.
- Step 202 The electronic device extracts multiple user feature values of the user behavior data of each user under a preset plurality of user behavior dimensions.
- the user behavior dimension may be divided to obtain a service layer feature dimension and a behavior layer feature dimension.
- the electronic device can quickly extract user feature values under multiple user behavior dimensions.
- the service layer feature dimensions may include: IM, web browsing, community forum, traffic, file transfer, and mail.
- Behavior layer feature dimensions can include: sending information, receiving information, sending files, FTP traffic, HTTPS traffic, and receiving mail.
- the electronic device obtains a plurality of user behavior dimensions by arbitrarily combining the content included in the foregoing two-layer feature dimension.
- the user behavior dimension obtained by the electronic device includes, but is not limited to, the number of IM transmission information, the number of IM reception information, the number of IM transmission files, the size of the IM transmission file, and the like.
- the electronic device extracts a plurality of user feature values for each of the plurality of users.
- Step 203 The electronic device determines a user feature vector of each of the plurality of users according to the plurality of user feature values of each of the plurality of users.
- the electronic device For each of a plurality of users, one user is taken as an example.
- the electronic device combines the multiple user feature values of the one user to obtain the user feature vector of the one user.
- the electronic device acquires the user behavior data 11 of the user A, the user behavior data 12 of the user B, and the user behavior data 13 of the user C.
- the electronic device extracts the number of IM transmission information from the user behavior data 11 to 10, the number of IM reception information to 8, the number of IM transmission files to 2, and the size of the IM transmission file to 500 KB.
- the electronic device extracts the number of IM transmission information from the user behavior data 12 to 9, the number of IM reception information is 8, the number of IM transmission files is 3, and the size of the IM transmission file is 490 KB.
- the electronic device extracts the number of IM transmission information from the user behavior data 13 to 10, the number of IM reception information to 7, the number of IM transmission files to 1, and the size of the IM transmission file to 600 KB.
- the electronic device may determine that the user feature vector of each user is: user A's user feature vector 01 is ⁇ 10, 8, 2, 500 ⁇ , and user B's user feature vector 02 is ⁇ 9, 8, 3, 490 ⁇ , the user feature vector 03 of the user C is ⁇ 10, 7, 1, 600 ⁇ .
- Step 204 The electronic device performs clustering processing on user feature vectors of multiple users by using a preset clustering algorithm to obtain multiple user classes.
- the preset clustering algorithm may be a K-means clustering algorithm, a K-means Plus clustering algorithm, or the like.
- the electronic device performs clustering processing on user feature vectors of multiple users through a preset clustering algorithm to obtain multiple user classes. At least one user feature vector is included in each user class.
- the preset clustering algorithm is a K-means clustering algorithm.
- the K-means clustering algorithm is used to cluster the user feature vectors of multiple users to obtain K initial user classes. Where K is a positive integer.
- the electronic device treats the K initial user classes as K user classes.
- Step 205 The electronic device determines a center vector of each of the plurality of user classes according to the user feature vector included in each of the plurality of user classes.
- a user class is taken as an example.
- the electronic device calculates the mean of the plurality of user feature vectors included in the one user class, and uses the mean as the center vector of the one user class.
- a plurality of user classes include user class 1, which includes user feature vector 01 of user A, user feature vector 02 of user B, and user feature vector 03 of user C.
- the electronic device calculates the mean value t 1 of the user feature vector 01, the user feature vector 02, and the user feature vector 03, and determines the mean t 1 as the center vector of the user class 1.
- Step 206 The electronic device acquires a difference feature vector of each user class in the plurality of user classes.
- the difference feature vector is: a user feature vector in the user class that is not within a preset distance value from a center vector of the user class, that is, between the difference feature vector and a center vector of the user class.
- the distance value is not within the preset distance value.
- the preset distance value range has been previously stored in the electronic device.
- the distance value of the user class from the center vector of the user class is not within the preset distance value. Specifically, the distance between the user feature vector in the user class and the center vector of the user class is less than the minimum range of the preset distance value. The value of the user feature vector in the user class and the center vector of the user class is greater than the maximum value of the range of the preset distance value.
- the distance value of the user feature vector and the center vector of the user class is smaller than the minimum value of the preset distance value range, or the user feature vector of the user class and the center vector of the user class appear.
- the electronic device determines that the user feature vector is a difference feature vector.
- a user class For each user class, a user class is taken as an example.
- the electronic device calculates a distance value between each user feature vector included in the one user class and a center vector of the one user class. After the electronic device obtains multiple distance values, the plurality of distance values are sorted. The electronic device acquires a distance value that is not within the preset distance value, and uses the acquired user value vector represented by the distance value as the difference feature vector.
- the user class 1 includes the user feature vector 01 of the user A, the user feature vector 02 of the user B, and the user feature vector 03 of the user C, and the center vector of the user class 1 is t 1 .
- the distance between the user feature vector 01 and the center vector t 1 is d 01
- the distance between the user feature vector 02 and the center vector t 1 is d 02
- the distance between the user feature vector 03 and the center vector t 1 is d 03 .
- the electronic device determines the user feature vector 01 represented by d 01 as the difference feature vector.
- the distribution of user feature vectors is different in different user classes.
- the preset distance value range of each user class may be separately stored in the electronic device.
- Step 207 The electronic device determines the user represented by the difference feature vector as an abnormal user.
- the electronic device determines the user feature vector 01 as the difference feature vector, and then determines that the user represented by the user feature vector 01 is determined to be an abnormal user, that is, the user A is determined to be an abnormal user.
- the electronic device obtains the difference feature vector in the user class by performing clustering processing on the user feature vector.
- the electronic device realizes recognition of an abnormal user according to the difference feature vector.
- the administrator does not need to add the restricted user name to the blacklist, and the electronic device does not need to identify the abnormal user by establishing a blacklist.
- the abnormal user identification method provided by the embodiment of the present disclosure realizes identification of a user whose manager is unknown and cannot find abnormal behavior.
- the electronic device stores a preset number threshold for limiting the number of user feature vectors included in the user class.
- the electronic device performs clustering processing on the user feature vectors of the plurality of users by using a preset clustering algorithm to obtain a plurality of user classes (step 204), which may include the following steps.
- the K-means clustering algorithm is used to cluster the user feature vectors of multiple users to obtain K initial user classes.
- the electronic device detects whether there are initial user classes including the number of user feature vectors less than the number threshold in the K initial user classes. If there is no initial user class including the number of user feature vectors less than the number threshold, the electronic device treats the K initial user classes as K user classes.
- the electronic device acquires the first initial user class and the second initial user class of the K initial user classes.
- the first initial user class is: K initial user classes, including an initial user class whose number of user feature vectors is less than a preset number threshold.
- the second initial user class is: an initial user class represented by a center vector having the smallest distance value from the center vector of the first initial user class among the K initial user classes.
- the electronic device performs a merge process on the first initial user class and the second initial user class to obtain a merged initial user class.
- the electronic device combines the initial user class as a clustered user class, and uses other initial user classes that are not merged in the K initial user classes as the clustered user class. Further, the electronic device obtains a plurality of user classes.
- the preset number threshold is 10.
- the electronic device clusters the user feature vectors of multiple users through the K-means clustering algorithm to obtain five initial user classes. For example, initial user class 1, initial user class 2, initial user class 3, initial user class 4, and initial user class 5.
- the initial user class 1 includes 8 user feature vectors
- the initial user class 2 includes 12 user feature vectors
- the initial user class 3 includes 11 user feature vectors
- the initial user class 4 includes 15 user feature vectors.
- the initial user class 5 includes 17 user feature vectors.
- the electronic device calculates a distance value between the center vector of the initial user class 2 and the center vector of the initial user class 1 as d 11 .
- the electronic device calculates a distance value between the center vector of the initial user class 3 and the center vector of the initial user class 1 as d 12 .
- the electronic device calculates a distance value between the center vector of the initial user class 4 and the center vector of the initial user class 1 as d 13 .
- the electronic device calculates a distance value between the center vector of the initial user class 5 and the center vector of the initial user class 1 as d 14 .
- the electronic device may determine that the initial user class 2 is the second initial user class.
- the electronic device merges the initial user class 1 with the initial user class 2 to obtain a merged initial user class 1.
- the electronic device combines the initial user class 1 as the user class 01 after the clustering process, the uncombined initial user class 3 as the user class 03 after the clustering process, and the initial user class 4 as the clustering process.
- the user class 04 uses the initial user class 5 as the user class 05 after the clustering process. In this way, the electronic device gets 4 user classes.
- the electronic device may perform a combination process on the obtained multiple user classes by calculating an aggregate value of the user feature vector.
- the aggregated value is used to characterize the reasonable degree to which the user feature vector belongs to the user class.
- the electronic device can take the following steps to obtain an aggregate value.
- the electronic device calculates a first distance value between the first user feature vector and each of the second user feature vectors, respectively.
- the second user feature vector is: a user feature vector other than the first user feature vector included in the user class where the first user feature vector is located.
- the electronic device performs average processing on the plurality of first distance values to obtain a first distance average value.
- the electronic device calculates a second distance value between the first user feature vector and each of the third user feature vectors, respectively.
- the third user feature vector is: a user feature vector included in each user class except the user class in which the first user feature vector is located.
- the electronic device performs average processing on a plurality of second distance values belonging to the same user class to obtain a plurality of second distance average values.
- the electronic device acquires a distance mean minimum value among the plurality of second distance averages.
- the electronic device calculates a ratio of the first distance mean value to the distance mean minimum value, and uses the ratio of the first distance mean value and the distance mean minimum value as the aggregate value of the first user feature vector.
- a distribution map of user classes as shown in FIG. Each black dot in Figure 4 represents a user feature vector.
- the user class 11, the user class 12, and the user class 13 are included in FIG. 11 to user class of the user feature vector L 11 comprises, for example, in calculating the aggregate value, the electronic device calculates a first distance between the user and the user characteristic vector category 11 L 11 L 12 included value d 21, L 11 and calculates The first distance value d 22 between the user feature vectors L 13 included by the user class 11 calculates a first distance value d 23 between the L 11 and the user feature vector L 14 included by the user class 11.
- the electronic device calculates the mean of d 21 , d 22 and d 23 to obtain a first distance mean D 1 .
- the electronic device calculates a second distance value d 24 between the L 11 and the user feature vector L 21 included by the user class 12, and calculates a second distance value d 25 between the L 11 and the user feature vector L 22 included by the user class 12, A second distance value d 26 between L 11 and the user feature vector L 23 included by the user class 12 is calculated.
- the electronic device calculates the mean of d 24 , d 25 and d 26 to obtain a second distance mean D 2 .
- the electronic device calculates a second distance value d 27 between the L 11 and the user feature vector L 31 included by the user class 13 , and calculates a second distance value d 28 between the L 11 and the user feature vector L 32 included by the user class 13 .
- a second distance value d 29 between the L 11 and the user feature vector L 33 included by the user class 13 is calculated.
- the electronic device calculates the mean of d 27 , d 28 and d 29 to obtain a second distance mean D 3 .
- the electronic device calculates the ratio D D 1 and 2, i.e., D 1 / D 2, the D 1 / D 2 L polymerization as a user characteristic vector values of 11 J 11.
- the electronic device can calculate an aggregated value of other user feature vectors included in the user class 11 and an aggregated value of the user feature vector included in the user class 12 and the user class 13, and will not be repeated herein.
- the process of the electronic device performing the combining process on the obtained multiple user classes may include the following steps.
- the electronic device calculates a distance value between the center vectors of any two of the plurality of user classes to obtain a plurality of distance values.
- the electronic device acquires the minimum distance value and determines the first user class and the second user class characterized by the minimum distance value.
- the electronic device acquires a first aggregated value of a user feature vector included in each of the plurality of user classes.
- the electronic device can obtain a plurality of first aggregated values.
- the electronic device uses the first user class and the second user class as the merged user class, that is, when the electronic device uses the first user class and the second user class as one user class, the electronic device acquires the merged user class.
- the electronic device can obtain a plurality of second aggregate values.
- the electronic device accumulates the plurality of first aggregated values to obtain a first sum value.
- the electronic device accumulates the plurality of second aggregated values to obtain a second sum value.
- the sum of the aggregated values of all the user feature vectors included in the plurality of user classes is used to evaluate the quality of the clustering effect.
- the electronic device determines that the clustering effect of combining the first user class and the second user class is better, and the first user class and the second user class are combined.
- the electronic device recalculates the distance value between the center vectors of any two user classes in the plurality of user classes, and determines two user classes represented by the minimum distance values among the obtained plurality of distance values, for the two user classes.
- the merging process is performed until the second sum value is not less than the first sum value.
- the electronic device calculates: a distance value z 1 between the center vector of the user class 11 and the center vector of the user class 12, between the center vector of the user class 11 and the center vector of the user class 13 The distance value z 2 , the distance value z 3 between the center vector of the user class 12 and the center vector of the user class 13. If z 1 ⁇ z 2 ⁇ z 3 , z 1 is the smallest, then the user class 11 represented by z 1 is determined as the first user class, and the user class 12 represented by z 1 is taken as the second user class.
- the electronic device is calculated: a user feature vector L aggregate value 11 and J 11, the user feature vector L aggregate value 12 and J 12, the user feature vector L aggregate value 13, J 13, the user feature vector polymerizable L 14 of Value J 14 .
- the electronic device is calculated: a user feature vector L aggregate value 21 21 J, L polymerization user feature vector value 22 22 J, L eigenvectors user aggregate value of 23 J 23. 13 for the user class, the electronic device is calculated: a user feature vector L aggregate value 31 31 J, L eigenvectors user aggregate value 32 32 J, L eigenvectors user aggregate value of 33 J 33.
- the electronic device uses the user class 11 and the user class 12 as the merged user class 01.
- the electronic device is calculated: a user feature vector L aggregate value 11 and J 01
- the user feature vector polymerizable L 12 is the value of J 02
- the user feature vector polymerizable L 13 is the value of J 03
- the user feature vector L 14 of aggregate value J 04 is calculated: a user feature vector aggregate value 31 L J 08
- the user feature vector aggregate value 32 L J 09 the user feature vector L aggregate value of 33 J 10.
- the electronic device performs a merge process on the user class 11 and the user class 12 to obtain a merged user class 01. Otherwise, the electronic device does not merge the user class 11 and the user class 12.
- the electronic device may also obtain the aggregated value by the following steps.
- the user feature vector L 11 included in the user class 11 in FIG. 4 is still taken as an example for description.
- the electronic device calculates D 1 , D 2 and D 3 , where D 2 ⁇ D 3 .
- the electronic device calculates the ratio of D 2 and D 1 , ie D 2 /D 1 . Thereafter, the electronic device (D 2 / D 1 -1) as the value of the user feature vector L polymerization of 11 J 11.
- the electronic device may also obtain the aggregated value by the following steps.
- the user feature vector L 11 included in the user class 11 in FIG. 4 is still taken as an example for description.
- the electronic device calculates D 1 , D 2 and D 3 , where D 2 ⁇ D 3 .
- the electronic device calculates the ratio of D 1 and D 2 , ie D 1 /D 2 . Thereafter, the electronic device (1-D 1 / D 2 ) as the value of the user feature vector L polymerization of 11 J 11.
- the process of combining the obtained multiple user classes may include the following steps, based on the aggregated value obtained by subtracting the ratio, or the aggregated value obtained by subtracting 1 from the ratio.
- the electronic device calculates a distance value between the center vectors of any two of the plurality of user classes to obtain a plurality of distance values.
- the electronic device From the obtained plurality of distance values, the electronic device obtains the minimum distance value and determines the first user class and the second user class characterized by the minimum distance value.
- the electronic device acquires a first aggregated value of a user feature vector included in each of the plurality of user classes.
- the electronic device can obtain a plurality of first aggregated values.
- the electronic device uses the first user class and the second user class as the merged user class, that is, when the electronic device uses the first user class and the second user class as one user class, the electronic device acquires the merged user class.
- the electronic device can obtain a plurality of second aggregate values.
- the electronic device accumulates the plurality of first aggregated values to obtain a first sum value.
- the electronic device accumulates the plurality of second aggregated values to obtain a second sum value.
- the electronic device determines that the clustering effect of the first user class and the second user class is better, and the first user class and the second user class are combined.
- the electronic device recalculates the distance value between the center vectors of any two user classes in the plurality of user classes, and determines two user classes represented by the minimum distance values among the obtained plurality of distance values, for the two user classes.
- the merging process is performed until the second sum value is not greater than the first sum value.
- the electronic device may perform rough classification on multiple users according to user attributes of each user of the multiple users, and obtain a rough classification. For each coarse classification, take a rough classification as an example.
- the electronic device performs clustering processing on the plurality of user feature vectors included in the one rough classification by using a preset clustering algorithm to obtain a plurality of user classes.
- user attributes include job attributes.
- Job attributes include: meeting, cashier, human resources, customer service, R&D design, and more.
- the user is roughly classified according to the user's job attributes. For example, users who belong to the finance department, such as credits and cashiers, are divided into a rough classification, and users who belong to the personnel department such as human resources are divided into a rough classification, and users belonging to the administrative department such as customer service are divided into a rough classification, which will be developed. Users such as design belong to the design department are divided into a rough classification, and so on.
- the electronic device When the clustering process is performed, the electronic device performs a user clustering vector of a plurality of users included in each of the four rough classifications of the design department, the finance department, the administration department, and the personnel department by using a preset clustering algorithm. Cluster processing to get multiple user classes.
- the electronic device pre-stores a range of distance values for each user class.
- the range of distance values is used to limit the distance value between the user feature vector included in the user class and the center vector of the user class.
- the electronic device can determine the range of distance values using the following steps.
- the electronic device calculates a center vector of the user class X, and a distance value from each user feature vector included in the user class X, respectively, to obtain a plurality of distance values.
- User class X is any user class.
- the electronic device calculates a distance average of a plurality of distance values as a third distance average.
- the electronic device also calculates the standard deviation of the plurality of distance values as the first standard deviation.
- the electronic device constructs a normal distribution curve according to the third distance mean and the first standard deviation. The normal distribution curve is used to characterize the distance value distribution between the center vector of the user class X and the user feature vector included in the user class X.
- the electronic device determines the first boundary value and the second boundary value according to the third distance mean and the first standard deviation.
- the first boundary value is smaller than the third distance mean value, and the absolute value of the difference between the first boundary value and the third distance mean value is: a first standard deviation of the preset multiple.
- the second boundary value is greater than the third distance mean value, and the absolute value of the difference between the second boundary value and the third distance mean value is also: a first standard deviation of the preset multiple.
- the electronic device determines the interval composed of the first boundary value and the second boundary value as the range of distance values of the user class X.
- the preset multiple is 3.
- the electronic device determines the range of the distance value of the user class X based on 3 standard deviations, as shown in FIG. 5.
- ⁇ 1 is the third distance mean value
- s is the first standard deviation
- the distance value range is ⁇ 1 -3 s to ⁇ 1 + 3 s.
- data with a distance greater than 3 standard deviations from the third distance mean ⁇ 1 is a small probability event that characterizes events that are unlikely to occur.
- the electronic device may determine that the user feature vector is a difference feature vector.
- the distance values of the user vector's center vector and the user feature vector included in the user class do not all conform to the normal distribution curve.
- the electronic device can determine the range of distance values in the following manner.
- the electronic device calculates a distance value between the center vector of the user class X and each user feature vector included in the user class X to obtain a plurality of distance values.
- the electronic device calculates a logarithmic value of each of the plurality of distance values according to a preset logarithmic function.
- the electronic device also calculates the mean of the multiple logarithms as a log-mean.
- the electronic device also calculates a standard deviation of a plurality of logarithms as a second standard deviation.
- the electronic device constructs a normal distribution curve based on the log mean and the second standard deviation. The normal distribution curve is used to characterize the logarithmic distribution of the distance value between the center vector of the user class X and the user feature vector included in the user class X.
- the electronic device determines the third boundary value and the fourth boundary value based on the log mean and the second standard deviation.
- the third boundary value is smaller than the log-average value, and the absolute value of the difference between the third boundary value and the log-average value is: a preset multiple of the second standard deviation.
- the fourth boundary value is greater than the log mean value, and the absolute value of the difference between the second boundary value and the log mean value is also: a preset multiple of the second standard deviation.
- the electronic device determines the interval composed of the first opposition value and the second opposition value as the range of distance values of the user class X.
- the electronic device determines whether the user feature value corresponding to the difference feature vector exceeds a pre- Set the characteristic baseline value.
- each user behavior dimension the electronic device is pre-configured with a feature baseline value.
- the electronic device may determine that the user behavior characterized in the user behavior dimension is an abnormal user behavior, and determine that the user represented by the difference feature vector is an abnormal user.
- the electronic device determines that the user behavior characterized in the user behavior dimension is a normal user behavior. If all the user feature values corresponding to the difference feature vector do not exceed the feature baseline value, the electronic device determines that the user represented by the difference feature vector is a normal user.
- the feature baseline value for user behavior dimension 1 is X 1
- the feature baseline value for user behavior dimension 2 is X 2
- the feature baseline value for user behavior dimension 3 is X 3
- the difference feature vector includes the user feature value 1 of the user behavior dimension 1, the user feature value 2 of the user behavior dimension 2, and the user feature value 3 of the user behavior dimension 3.
- the electronic device may determine that the user behavior characterized by the user behavior dimension 1 is an abnormal user behavior, and the user represented by the difference feature vector is an abnormal user.
- the electronic device may determine that the user behavior characterized by the user behavior dimension 2 is an abnormal user behavior, and the user represented by the difference feature vector is an abnormal user.
- the electronic device may determine that the user behavior characterized by the user behavior dimension 3 is an abnormal user behavior, and the user represented by the difference feature vector is an abnormal user.
- the electronic device may determine the user characterized by the difference feature vector For normal users.
- the electronic device may directly determine the feature baseline value of the user behavior dimension 1.
- the frequency of the user switching MAC address is generally 1 or 2 times in one day.
- the electronic device can determine that the characteristic baseline value of the user behavior dimension of the MAC address switching frequency is 2.
- the electronic device For a user behavior dimension with a large difference in user feature values, for example, user behavior dimension 2, the electronic device counts the probability density distribution of the user feature values of the plurality of user behavior data under the user behavior dimension 2. The electronic device determines a characteristic baseline value of the user behavior dimension 2 based on the probability density distribution.
- the horizontal axis represents the user characteristic value
- the vertical axis represents the cumulative probability.
- the rectangle in the coordinate axis is the probability density of the user's feature value.
- the cumulative probability curve is obtained based on the probability density distribution.
- the slope of the cumulative probability curve is much smaller than the average slope.
- the electronic device can determine the characteristic baseline value of the user behavior dimension represented by FIG. 6: less than 20 or greater than 120.
- FIG. 7 is still another schematic flowchart of an abnormal user identification method according to an embodiment of the present disclosure, where the method includes the following steps.
- Step 701 The electronic device acquires multiple user behavior data of the user to be identified.
- the plurality of user behavior data includes at least one historical user behavior data and one current user behavior data.
- the electronic device when it is required to detect whether the user to be identified is an abnormal user, acquires multiple user behavior data of the user to be identified.
- the user to be identified is taken as an example for illustration and is not intended to be limiting.
- the electronic device can obtain a plurality of user behavior data of the user to be identified from the user behavior log.
- the user behavior log is used to record various network behaviors of the user.
- the electronic device may also acquire a plurality of user behavior data of the user to be identified from the user behavior data input by the user.
- Embodiments of the present disclosure do not limit the form in which an electronic device acquires user behavior data.
- the electronic device may acquire user behavior data of different users according to a preset time granularity. Among them, the electronic device can set different time granularities according to various requirements for abnormal user identification.
- the electronic device acquires multiple user behavior data of the user to be identified according to a preset time granularity.
- the preset time granularity of the electronic device is 10 minutes.
- the electronic device can acquire the user behavior data 31 of the user A1 in the time period indicated by 9:50-10:00, the user behavior data 32 of the user A1 in the time period indicated by 9:40-9:50, and The user behavior data 33 of the user A1 in the time period indicated by 9:30-9:40.
- the user behavior data 31 is the current user behavior data of the user A1.
- the user behavior data 32 and the user behavior data 33 and the like are historical user behavior data of the user A1.
- the preset time granularity of the electronic device is 10 minutes.
- the electronic device can acquire the user behavior data 41 of the user A1 in the time period indicated by 10:00-10:10, and the user behavior data 42 of the user A1 in the time period indicated by 9:50-10:00, at 9 : User behavior data 43 of the user A1 in the time period indicated by 40-9:50, and user behavior data 44 of the user A1 in the time period indicated by 9:30-9:40.
- the user behavior data 41 is the current user behavior data of the user A1.
- User behavior data 42, user behavior data 43, and user behavior data 44 are historical user behavior data of user A1.
- Step 702 The electronic device extracts multiple first data feature values of each historical user behavior data in a preset multiple user behavior dimension, and extracts multiple second data of the current user behavior data in multiple user behavior dimensions. Eigenvalues.
- the user behavior dimension may be divided to obtain a service layer feature dimension and a behavior layer feature dimension.
- the electronic device can quickly extract data feature values under multiple user behavior dimensions.
- the electronic device obtains multiple user behavior dimensions by arbitrarily combining the content included in the business layer feature dimension and the behavior layer feature dimension.
- the user behavior dimension obtained by the electronic device includes, but is not limited to, the number of IM transmission information, the number of IM reception information, the number of IM transmission files, the size of the IM transmission file, and the like.
- the electronic device extracts a plurality of first data feature values of each historical user behavior data in the plurality of user behavior data, and extracts a plurality of current user behavior data in the plurality of user behavior data. Second data feature values.
- Step 703 The electronic device determines a first data feature vector of each historical user behavior data according to the plurality of first data feature values, and determines a second data feature vector of the current user behavior data according to the plurality of second data feature values.
- a historical user behavior data is taken as an example.
- the electronic device combines the plurality of first data feature values of the historical user behavior data to obtain a first data feature vector of the historical user behavior data.
- the electronic device For the current user behavior data in the plurality of user behavior data, the electronic device combines the plurality of second data feature values of the current user behavior data to obtain a second data feature vector of the current user behavior data.
- the electronic device acquires the user behavior data 31 of the user A1, the user behavior data 32 of the user A1, and the user behavior data 33 of the user A1.
- the electronic device extracts the number of IM transmission information from the user behavior data 31 to 10, the number of IM reception information to 8, the number of IM transmission files to 2, and the size of the IM transmission file to 500 KB.
- the electronic device extracts the number of IM transmission information from the user behavior data 32 to 9, the number of IM reception information is 8, the number of IM transmission files is 3, and the size of the IM transmission file is 490 KB.
- the electronic device extracts from the user behavior data 33 that the number of IM transmission information is 10, the number of IM reception information is 7, the number of IM transmission files is 1, and the size of the IM transmission file is 600 KB.
- the electronic device may determine that the data feature vector 01 of the user behavior data 31 is ⁇ 10, 8, 2, 500 ⁇ , and the data feature vector 02 of the user behavior data 32 is ⁇ 9, 8, 3, 490 ⁇ , user behavior.
- the data feature vector 03 of the data 33 is ⁇ 10, 7, 1, 600 ⁇ .
- the data feature vector 01 is a second data feature vector, and the data feature vector 02 and the data feature vector 03 are first data feature vectors.
- Step 704 The electronic device performs clustering processing on the plurality of first data feature vectors and the second data feature vector by using a preset clustering algorithm to obtain multiple data classes.
- the preset clustering algorithm may be a K-means clustering algorithm, a K-means Plus clustering algorithm, or the like.
- the electronic device performs clustering processing on the plurality of first data feature vectors and the second data feature vector by using a preset clustering algorithm to obtain a plurality of data classes.
- Each data class includes at least one data feature vector.
- the preset clustering algorithm is a K-means clustering algorithm.
- the electronic device clusters the plurality of first data feature vectors and the second data feature vector by K-means clustering algorithm to obtain K initial data classes. Where K is a positive integer.
- the electronic device takes the K initial data classes as K data classes.
- Step 705 The electronic device determines a first center vector of the first data class to which the second data feature vector belongs.
- the electronic device determines, from the plurality of data classes, the first data class to which the second data feature vector belongs, calculates an average value of the plurality of data feature vectors included in the first data class, and uses the average value as the first data.
- the center vector of the class to determine if the current user to be identified is an abnormal user.
- the center vector of the first data class is the first central vector.
- the first data class includes a data feature vector 01, a data feature vector 02, and a data feature vector 03.
- the electronic device calculates the mean t 2 of the data feature vector 01, the data feature vector 02, and the data feature vector 03, and determines the calculated mean t 2 as the first center vector of the first data class.
- Step 706 The electronic device determines a distance value between the second data feature vector and the first center vector.
- the data feature vector 01 is a second data feature vector
- the center vector of the first data class is t 2 .
- the electronic device calculates a distance value d a1 between the data feature vector 01 and the center vector t 2 .
- Step 707 If the distance value is not within the preset distance range, the electronic device determines that the user to be identified is an abnormal user.
- the electronic device determines a distance value between the second data feature vector and the first center vector, and determines whether the determined distance value is within a preset distance value range. If the range is not within the preset distance, the electronic device may determine that the second data feature vector is a difference feature vector, and determine that the user represented by the second data feature vector is an abnormal user, that is, determine that the user to be identified is an abnormal user.
- the electronic device is pre-set with a range of distance values.
- the distribution of data feature vectors is different.
- the electronic device may preset a range of distance values of the first data class.
- the electronic device determines that the distance value is not within the preset distance value, determining that the user to be identified is an abnormal user . If the distance value of the electronic device is within the preset distance value, it is determined that the user to be identified is a normal user.
- the preset distance value ranges from d a01 to d a02 .
- the electronic device calculates a distance value d a1 between the data feature vector 01 and the center vector t 2 . If d a1 ⁇ d a01 or d a1 >d a02 , the electronic device may determine that the user to be identified is an abnormal user, that is, the electronic device may determine that the user A1 is an abnormal user.
- the electronic device performs clustering processing on the data feature vector to obtain a first data class to which the current user behavior data class belongs.
- the electronic device realizes the identification of the abnormal user according to the distance between the second data feature vector in the first data class and the center vector of the first data class.
- the administrator does not need to add the restricted user name to the blacklist, and the electronic device does not need to identify the abnormal user by establishing a blacklist.
- the abnormal user identification method provided by the embodiment of the present disclosure realizes identification of a user whose manager is unknown and cannot find abnormal behavior.
- the electronic device stores a preset number threshold for limiting the number of data feature vectors included in the data class.
- the electronic device performs clustering processing on the plurality of first data feature vectors and the second data feature vector by using a preset clustering algorithm to obtain a plurality of data classes (step 704), which may include the following steps.
- the electronic device clusters the plurality of first data feature vectors and the second data feature vector by K-means clustering algorithm to obtain K initial data classes.
- the electronic device acquires a first initial data class of the K initial data classes.
- the first initial data class includes N data feature vectors, and N is a positive integer.
- the first initial data class is: an initial data class to which the second data feature vector belongs to the K initial data classes.
- the electronic device detects if N is less than a quantity threshold. If N is not less than the number threshold, the electronic device treats the K initial data classes as K data classes.
- the electronic device acquires a second initial data class of the K initial data classes.
- the second initial data class is: an initial data class represented by a center vector having the smallest distance value from the center vector of the first initial data class among the K initial data classes.
- the electronic device combines the first initial data class and the second initial data class to obtain a merged initial data class.
- the electronic device combines the initial data class as a clustered data class, and uses other initial data classes that are not merged in the K initial data classes as the clustered data class. Furthermore, the electronic device obtains multiple data classes.
- the electronic device may perform the merge processing on the obtained multiple data classes by calculating the aggregated value of the data feature vector.
- the aggregated value is used to characterize the reasonable degree to which the data feature vector belongs to the data class.
- the electronic device can take the following steps to obtain an aggregate value.
- the electronic device calculates a first distance value between the third data feature vector and each of the fourth data feature vectors, respectively.
- the fourth data feature vector is: a data feature vector other than the third data feature vector included in the data class in which the third data feature vector is located.
- the electronic device performs average processing on the plurality of first distance values to obtain a first distance average value.
- the electronic device calculates a second distance value between the third data feature vector and each of the fifth data feature vectors.
- the fifth data feature vector is: a data feature vector included in each data class except the data class in which the third data feature vector is located.
- the electronic device performs average processing on a plurality of second distance values belonging to the same data class to obtain a plurality of second distance average values.
- the electronic device acquires a distance mean minimum value among the plurality of second distance averages.
- the electronic device calculates a ratio of the first distance mean value to the distance mean minimum value to obtain an aggregated value of the third data feature vector.
- the process of the electronic device performing the merge process on the obtained plurality of data classes may include the following steps.
- the electronic device calculates a distance value between the first center vector and a second center vector of any one of the plurality of data classes except the first data class, to obtain a plurality of distance values. That is, the electronic device calculates a distance value between the first center vector and the second center vector to obtain a plurality of distance values.
- the second center vector is a center vector of any one of the plurality of data classes except the first data class.
- the electronic device From the plurality of distance values obtained, the electronic device acquires a minimum distance value and determines a second data class characterized by the smallest distance value.
- the electronic device acquires a third aggregated value of the data feature vector included in each of the plurality of data classes.
- the electronic device can obtain a plurality of third aggregated values.
- the electronic device uses the first data class and the second data class as the merged data class, that is, when the electronic device uses the first data class and the second data class as one data class, the data features included in the merged data class are acquired.
- the electronic device can obtain a plurality of fourth aggregated values.
- the electronic device accumulates the plurality of third aggregated values to obtain a third sum value.
- the electronic device performs an accumulation process on the plurality of fourth aggregated values to obtain a fourth sum value.
- the sum of the aggregated values of all the data feature vectors in the plurality of data classes is used to evaluate the quality of the clustering effect.
- the electronic device determines that the clustering effect of combining the first data class and the second data class is better, and the first data class and the second data class are combined.
- the electronic device recalculates the distance between the center vector of the first data class and the center vector of any one of the plurality of data classes except the first data class, and determines the minimum distance value among the plurality of distance values obtained.
- the second data class is characterized by combining the first data class and the second data class until the fourth sum value is not less than the third sum value.
- the electronic device may also obtain the aggregated value by the following steps.
- the electronic device may also obtain the aggregated value by the following steps.
- the process of combining the obtained multiple data classes based on the aggregated value obtained by subtracting the ratio from 1 or the aggregated value obtained by subtracting 1 from the ratio may include the following steps.
- the electronic device calculates a distance value between the first center vector and a second center vector of any one of the plurality of data classes except the first data class, to obtain a plurality of distance values. That is, the electronic device calculates a distance value between the first center vector and the second center vector to obtain a plurality of distance values.
- the second center vector is a center vector of any one of the plurality of data classes except the first data class.
- the electronic device From the plurality of distance values obtained, the electronic device obtains a minimum distance value and determines a second data class characterized by the smallest distance value.
- the electronic device acquires a third aggregated value of the data feature vector included in each of the plurality of data classes.
- the electronic device can obtain a plurality of third aggregated values.
- the electronic device uses the first data class and the second data class as the merged data class, that is, when the electronic device uses the first data class and the second data class as one data class, the data features included in the merged data class are acquired.
- the electronic device can obtain a plurality of fourth aggregated values.
- the electronic device accumulates the plurality of third aggregated values to obtain a third sum value.
- the electronic device performs an accumulation process on the plurality of fourth aggregated values to obtain a fourth sum value.
- the electronic device determines that the clustering effect of combining the first data class and the second data class is better, and the first data class and the second data class are combined.
- the electronic device recalculates the distance between the center vector of the first data class and the center vector of any one of the plurality of data classes except the first data class, and determines the minimum distance value among the plurality of distance values obtained.
- the second data class is characterized by combining the first data class and the second data class until the fourth sum value is not greater than the third sum value.
- the electronic device pre-stores a range of distance values of the first data class.
- the range of distance values is used to limit the distance between the data feature vector in the data class and the center vector of the data class.
- the electronic device can determine a range of distance values for the first data class in the following manner.
- the electronic device calculates a distance value between the first center vector and each of the data feature vectors included in the first data class to obtain a plurality of distance values.
- the electronic device calculates a distance average of a plurality of distance values as a third distance average.
- the electronic device also calculates the standard deviation of the plurality of distance values as the first standard deviation.
- the electronic device can construct a normal distribution curve according to the third distance mean and the first standard deviation. The normal distribution curve is used to characterize a distance value distribution between the first center vector and the data feature vector included in the first data class.
- the electronic device determines the first boundary value and the second boundary value according to the third distance mean and the first standard deviation.
- the first boundary value is smaller than the third distance mean value, and the absolute value of the difference between the first boundary value and the third distance mean value is: a first standard deviation of the preset multiple.
- the second boundary value is greater than the third distance mean value, and the absolute value of the difference between the second boundary value and the third distance mean value is also: a first standard deviation of the preset multiple.
- the electronic device determines the interval composed of the first boundary value and the second boundary value as the range of distance values of the first data class.
- the preset multiple is 3.
- the electronic device determines the range of distance values of the first data class based on 3 standard deviations, as shown in FIG. 5.
- ⁇ 1 is the third distance mean value
- s is the first standard deviation
- the distance value range is ⁇ 1 -3 s to ⁇ 1 + 3 s.
- data with a distance greater than 3 standard deviations from the third distance mean ⁇ 1 is a small probability event that characterizes events that are unlikely to occur. If the distance value between the second data feature vector and the first center vector is not within the distance value, the electronic device may determine that the user to be identified is an abnormal user.
- the distance distribution of the first center vector and the data feature vector included in the first data class does not necessarily conform to a normal distribution.
- the electronic device may determine the range of distance values of the first data class in the following manner.
- the electronic device calculates a distance value between the first center vector and each of the data feature vectors included in the first data class to obtain a plurality of distance values.
- the electronic device calculates a logarithmic value of each of the plurality of distance values according to a preset logarithmic function.
- the electronic device also calculates the mean of the multiple logarithms as a log-mean.
- the electronic device also calculates the standard deviation of the plurality of logarithms obtained as the second standard deviation.
- the electronic device can construct a normal distribution curve according to the log mean and the second standard deviation.
- the normal distribution curve a logarithmic distribution for characterizing a distance value between the first center vector and the data feature vector in the first data class.
- the electronic device determines the third boundary value and the fourth boundary value based on the log mean and the second standard deviation.
- the third boundary value is smaller than the log mean value, and the absolute value of the difference between the third boundary value and the log mean value is: a second standard deviation of the preset multiple.
- the fourth boundary value is greater than the log mean value, and the absolute value of the difference between the second boundary value and the log mean value is also the second standard deviation of the preset multiple.
- the electronic device determines the interval composed of the first object value and the second object value as the range of distance values of the first data class.
- the electronic device determines, according to each user behavior dimension of the multiple user behavior dimensions, whether the data feature value corresponding to the second data feature vector is Exceeded the preset feature baseline value.
- each user behavior dimension the electronic device is pre-configured with a feature baseline value.
- the second data feature vector is a difference feature vector.
- the electronic device may determine that the user behavior characterized in the user behavior dimension is an abnormal user behavior, and determine that the user to be identified is an abnormal user.
- the electronic device may determine that the user behavior characterized in the user behavior dimension is normal user behavior. If all the data feature values corresponding to the second data feature vector do not exceed the feature baseline value, the electronic device determines that the user to be identified is a normal user.
- the electronic device may directly determine the feature baseline value of the user behavior dimension 1.
- the frequency of the user switching MAC address is generally 1 or 2 times in one day.
- the electronic device can determine that the characteristic baseline value of the user behavior dimension of the MAC address switching frequency is 2.
- the electronic device For a user behavior dimension with a large difference in data feature values, for example, user behavior dimension 2, the electronic device counts the probability density distribution of the data feature values of the plurality of user behavior data under the user behavior dimension 2. The electronic device determines a characteristic baseline value of the user behavior dimension 2 based on the probability density distribution.
- FIG. 8 is a schematic diagram of a first structure of an abnormal user identification apparatus according to an embodiment of the present disclosure.
- the apparatus includes: an obtaining unit 801, an extracting unit 802, a first determining unit 803, a clustering unit 804, and a second determining.
- the obtaining unit 801 is configured to acquire user behavior data of the user
- the extracting unit 802 is configured to extract multiple feature values of the user behavior data under a plurality of preset behavioral dimensions
- a first determining unit 803 configured to determine, according to the multiple feature values, a feature vector corresponding to the user behavior data
- the clustering unit 804 is configured to perform clustering processing on the feature vector by using a preset clustering algorithm to obtain multiple aggregation classes, and obtain a center vector of each aggregation class;
- a second determining unit 805 configured to determine a difference feature vector, where a distance value between the difference feature vector and a center vector of the associated aggregation class is not within a preset distance value range;
- the third determining unit 806 is configured to determine the user represented by the difference feature vector as an abnormal user.
- the electronic device acquires the difference feature vector whose distance from the center vector of the aggregation class is not within the preset distance value by performing clustering processing on the feature vector.
- the electronic device realizes recognition of the abnormal user according to the acquired difference feature vector.
- the administrator does not need to add the restricted user name to the blacklist.
- the electronic device does not need to identify the abnormal user by establishing a blacklist. This identifies the user who is unknown to the administrator and cannot find abnormal behavior.
- the user described above is a plurality of users.
- the obtaining unit 801 may be specifically configured to acquire user behavior data of multiple users.
- the extracting unit 802 may be specifically configured to extract a plurality of user feature values of the user behavior data of each of the plurality of users in a preset plurality of user behavior dimensions;
- the first determining unit 803 is specifically configured to determine a user feature vector of each of the multiple users according to the multiple user feature values of each of the plurality of users;
- the clustering unit 804 is specifically configured to perform clustering processing on user feature vectors of multiple users by using a preset clustering algorithm to obtain multiple user classes, according to user characteristics included in each user class of the multiple user classes. A vector that determines the center vector of each user class in multiple user classes.
- the clustering unit 804 can be specifically used to:
- K-means clustering algorithm is used to cluster the user feature vectors of multiple users to obtain K initial user classes; K is a positive integer;
- the initial user class and the other initial user classes that are not merged in the K initial user classes are merged as the user class after the clustering process, and multiple user classes are obtained;
- the first initial user class is: an initial user class in which the number of user feature vectors included in the K initial user classes is less than a preset number threshold;
- the second initial user class is: an initial user class represented by a center vector having the smallest distance value from the center vector of the first initial user class among the K initial user classes.
- the clustering unit 804 can also be used to:
- the first user class and the second user class are used as the merged user class, obtaining a second aggregated value of the user feature vector included in the merged user class, and acquiring each user class except the merged user class among the multiple user classes a second aggregated value of the included user feature vector;
- the aggregated value is used to characterize the reasonable degree to which the user feature vector belongs to the user class.
- the clustering unit 804 can also be used to:
- the second user feature vector is: a user feature other than the first user feature vector in the user class where the first user feature vector is located vector;
- the ratio of the first distance mean to the distance mean minimum is used as the aggregate value of the first user feature vector.
- the third determining unit 806 is specifically configured to:
- the user feature value corresponding to the difference feature vector exceeds the feature baseline value, determining that the user behavior characterized in the user behavior dimension is an abnormal user behavior, and determining that the user represented by the difference feature vector is an abnormal user.
- the user is a user
- the user behavior data may include at least one historical user behavior data of the user and a current user behavior data.
- the obtaining unit 801 may be specifically configured to acquire multiple user behavior data of the user to be identified, where the plurality of user behavior data includes: at least one historical user behavior data and one current user behavior data;
- the extracting unit 802 is specifically configured to extract, in the at least one historical user behavior data, a plurality of first data feature values of each historical user behavior data in a preset plurality of behavioral dimensions, and extract current user behavior data in multiple a plurality of second data feature values under the behavior dimension;
- the first determining unit 803 is specifically configured to determine, according to the plurality of first data feature values, a first data feature vector of each historical user behavior data in the at least one historical user behavior data, and according to the plurality of second data feature values Determining a second data feature vector of the current user behavior data;
- the clustering unit 804 is specifically configured to perform clustering processing on the plurality of first data feature vectors and the second data feature vector by using a preset clustering algorithm to obtain multiple data classes; and determine that the second data feature vector belongs to The center vector of the first data class;
- a second determining unit 805 configured to determine whether a distance value between the second data feature vector and a center vector of the first data class is within a preset distance value range; if not, determining that the second data feature vector is a difference feature vector .
- the clustering unit 804 can be specifically used to:
- the K-means clustering algorithm performs clustering processing on the plurality of first data feature vectors and the second data feature vector to obtain K initial data classes; K is a positive integer;
- the first initial data class includes N data feature vectors, and N is a positive integer;
- the initial data class and the other initial data classes that are not merged in the K initial data classes are combined as the data class after the clustering process, and multiple data classes are obtained;
- the first initial data class is: an initial data class to which the second data feature vector belongs;
- the second initial data class is: an initial data class represented by a center vector having the smallest distance value from a center vector of the first initial data class among the K initial data classes.
- the clustering unit 804 can also be used to:
- the first data class and the second data class are used as the merged data class, obtaining a fourth aggregated value of the data feature vector included in the merged data class, and acquiring each of the plurality of data classes except the merged data class includes a fourth aggregated value of the data feature vector;
- the aggregated value is used to characterize the reasonable degree to which the data feature vector belongs to the data class.
- the clustering unit 804 can also be used to:
- the fourth data feature vector is: a data feature other than the third data feature vector in the data class in which the third data feature vector is located vector;
- the fifth data feature vector is: a data feature vector in each data class except the data class in which the third data feature vector is located ;
- the ratio of the first distance mean to the distance mean minimum is used as the aggregate value of the third data feature vector.
- the third determining unit 806 is specifically configured to:
- the data feature value corresponding to the second data feature vector exceeds the feature baseline value, determining that the user behavior characterized in the user behavior dimension is an abnormal user behavior, and determining that the user to be identified is an abnormal user.
- the embodiment of the present disclosure further provides an electronic device, as shown in FIG. 9, including a processor 901 and a machine readable storage medium 902, and the machine readable storage medium 902 is stored.
- a processor 901 and a machine readable storage medium 902
- the machine readable storage medium 902 is stored.
- the machine executable instructions cause the processor 901 to:
- the feature vector is clustered by a preset clustering algorithm to obtain multiple aggregation classes, and the center vector of each aggregation class is obtained;
- the user characterized by the difference feature vector is determined to be an abnormal user.
- the electronic device acquires a difference feature vector whose distance from the center vector of the aggregation class is not within the preset distance value by performing clustering processing on the feature vector.
- the electronic device realizes recognition of the abnormal user according to the acquired difference feature vector.
- the administrator does not need to add the restricted user name to the blacklist.
- the electronic device does not need to identify the abnormal user by establishing a blacklist. This identifies the user who is unknown to the administrator and cannot find abnormal behavior.
- the machine executable instructions may specifically cause the processor 901 to:
- the user feature vector of multiple users is clustered by a preset clustering algorithm to obtain multiple user classes
- a center vector of each of the plurality of user classes is determined according to a user feature vector included in each of the plurality of user classes.
- machine executable instructions may specifically cause the processor 901 to:
- K-means clustering algorithm is used to cluster the user feature vectors of multiple users to obtain K initial user classes; K is a positive integer;
- the initial user class and the other initial user classes that are not merged in the K initial user classes are merged as the user class after the clustering process, and multiple user classes are obtained;
- the first initial user class is an initial user class in which the number of user feature vectors included in the K initial user classes is less than a preset number threshold;
- the second initial user class is an initial user class characterized by a center vector of the K initial user classes that has the smallest distance value from the center vector of the first initial user class.
- machine executable instructions may also cause the processor 901 to:
- the first user class and the second user class are used as the merged user class, obtaining a second aggregated value of the user feature vector included in the merged user class, and acquiring each user class except the merged user class among the multiple user classes a second aggregated value of the included user feature vector;
- the aggregated value is used to characterize the reasonable degree to which the user feature vector belongs to the user class.
- machine executable instructions may also cause the processor 901 to:
- the second user feature vector is: a user feature other than the first user feature vector in the user class where the first user feature vector is located vector;
- the third user feature vector is: a user feature vector in each user class except the user class in which the first user feature vector is located ;
- the ratio of the first distance mean to the distance mean minimum is used as the aggregate value of the first user feature vector.
- machine executable instructions may specifically cause the processor 901 to:
- the user feature value corresponding to the difference feature vector exceeds the feature baseline value, determining that the user behavior characterized in the user behavior dimension is an abnormal user behavior, and determining that the user represented by the difference feature vector is an abnormal user.
- the user behavior data when the user is a user, includes: at least one historical user behavior data of the user and a current user behavior data;
- the machine executable instructions may specifically cause the processor 901 to:
- machine executable instructions may specifically cause the processor 901 to:
- the K-means clustering algorithm performs clustering processing on the plurality of first data feature vectors and the second data feature vector to obtain K initial data classes; K is a positive integer;
- the first initial data class includes N data feature vectors, and N is a positive integer;
- the initial data class and the other initial data classes that are not merged in the K initial data classes are combined as the data class after the clustering process, and multiple data classes are obtained;
- the first initial data class is: an initial data class to which the second data feature vector belongs;
- the second initial data class is: an initial data class represented by a center vector having the smallest distance value from the center vector of the first initial data class among the K initial data classes.
- machine executable instructions may also cause the processor 901 to:
- the first data class and the second data class are used as the merged data class, obtaining a fourth aggregated value of the data feature vector included in the merged data class, and acquiring each of the plurality of data classes except the merged data class includes a fourth aggregated value of the data feature vector;
- the aggregated value is used to characterize the reasonable degree to which the data feature vector belongs to the data class.
- machine executable instructions may also cause the processor 901 to:
- the fourth data feature vector is: a data feature other than the third data feature vector in the data class in which the third data feature vector is located vector;
- the fifth data feature vector is: a data feature vector in each data class except the data class in which the third data feature vector is located ;
- the ratio of the first distance mean to the distance mean minimum is used as the aggregate value of the third data feature vector.
- machine executable instructions may specifically cause the processor 901 to:
- the data feature value corresponding to the second data feature vector exceeds the feature baseline value, determining that the user behavior characterized in the user behavior dimension is an abnormal user behavior, and determining that the user to be identified is an abnormal user.
- the electronic device may further include: a communication interface 903 and a communication bus 904; wherein the processor 901, the machine readable storage medium 902, and the communication interface 903 complete communication with each other through the communication bus 904, and the communication interface 903 is used for communication between the above electronic device and other devices.
- the communication bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
- PCI Peripheral Component Interconnect
- EISA Extended Industry Standard Architecture
- the communication bus can be divided into an address bus, a data bus, a control bus, and the like.
- the machine readable storage medium may include a random access memory (English: Random Access Memory, RAM for short), and may also include a non-volatile memory (Non-Volatile Memory, NVM for short), such as at least one disk storage. . Additionally, the machine readable storage medium can also be at least one storage device located remotely from the aforementioned processor.
- the processor may be a general-purpose processor, including a central processing unit (English: Central Processing Unit, CPU for short), a network processor (English: Network Processor, NP for short), or a digital signal processor (English: Digital Signal Processing (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices. Discrete gate or transistor logic, discrete hardware components.
- CPU Central Processing Unit
- NP Network Processor
- DSP Digital Signal Processing
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- an embodiment of the present disclosure further provides a machine readable storage medium storing an executable instruction of a machine, and when executed and executed by a processor, the machine executable instruction causes processing
- the device implements any of the abnormal user identification method steps shown in Figures 1-7 above.
- the embodiment of the present disclosure further provides a machine executable instruction, when invoked and executed by the processor, the machine executable instruction causes the processor to implement the above-mentioned FIG. 1-7. Any of the abnormal user identification method steps shown.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
获取多个用户的用户行为数据;提取每个用户的用户行为数据在预设的多个用户行为维度下的多个用户特征值;根据每个用户的多个用户特征值,确定每个用户的用户特征向量;通过预设的聚类算法,对多个用户的用户特征向量进行聚类处理,得到多个用户类;根据每个用户类包括的用户特征向量,确定每个用户类的中心向量;获取每个用户类的差异特征向量;其中,差异特征向量为用户类中与用户类的中心向量的距离值未在预设距离值范围内的用户特征向量;将差异特征向量所表征的用户确定为异常用户。
Description
相关申请的交叉引用
本公开要求于2018年5月14日提交中国专利局、公开号为201810457994.8发明名称为“一种异常用户识别方法及装置”的中国专利公开的优先权,其全部内容通过引用结合在本公开中。
在网络系统中,为了保证网络系统中的硬件、软件及数据得到更好的保护,使网络系统连续可靠地运行。通常在连接内网与外网的边缘路由器处架设安全设备。由安全设备对内网发出的报文或者外网发入的报文进行筛选、过滤,以保证网络系统的安全。
目前,由于用户行为的不可预测性,使得检测异常用户变得复杂。例如,针对在不同时段、不同位置的用户执行的不同种类操作的检测。在一种场景下,某一用户频繁收发邮件、打开非法网页、下载非法视频等等。
在对上述用户进行检测时,需要检测出频繁收发邮件的用户,也需要检测出频繁打开非法网页的用户,还需要检测出频繁下载非法视频的用户,等等。
附图简要说明
图1为本公开实施例提供的异常用户识别方法的一种流程示意图;
图2为本公开实施例提供的异常用户识别方法的另一种流程示意图;
图3为本公开实施例提供的特征体系的一种示意图;
图4为本公开实施例提供的用户类的一种分布图;
图5为本公开实施例提供的正态分布曲线的一种示意图;
图6为本公开实施例提供的累计概率曲线的一种示意图;
图7为本公开实施例提供的异常用户识别方法的再一种流程示意图;
图8为本公开实施例提供的异常用户识别装置的一种结构示意图;
图9为本公开实施例提供的电子设备的一种结构示意图。
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
目前,通过设置黑名单方式可实现对内网异常用户的识别。具体的,管理人员将需要限制的用户名加入黑名单中。但通过前述设置黑名单的方式,仅能对管理人员已知的异常用户进行识别。前述设置黑名单的方式,无法对管理人员未知且无法发现异常行为的用户进行识别。
针对上述问题,本公开实施例提供了一种异常用户识别方法。该异常用户识别方法可以应用于服务器、电脑、手机、安全设备等电子设备。为便于说明,下面均以执行主体为电子设备为例进行说明。
具体的,参考图1,图1为本公开实施例提供的异常用户识别方法的一种流程示意图。本公开实施例 提供的异常用户识别方法包括如下步骤。
步骤101,电子设备获取用户的用户行为数据。
本公开实施例中,电子设备可以获取多个用户的用户行为数据,也可以获取一个用户的多个用户行为数据。若电子设备获取了一个用户的多个用户行为数据,则这多个用户行为数据包括至少一个历史用户行为数据和一个当前用户行为数据。
本公开实施例中,当需要检测异常用户时,电子设备获取用户的用户行为数据。
电子设备可以从用户行为日志中,获取到用户的用户行为数据。这里,用户行为日志用于记录用户的各种网络行为。另外,电子设备也可以从用户输入的用户行为数据中获取到用户的用户行为数据。本公开实施例不限定电子设备获取用户行为数据的形式。
一个实施例中,电子设备可根据对异常用户识别的多种需求,设置不同的时间粒度。电子设备获取预设的时间粒度内用户的用户行为数据。
步骤102,电子设备提取用户行为数据在预设的多个行为维度下的多个特征值。
具体地,为了便于电子设备提取用户行为数据在多个行为维度下的特征值,可对行为维度进行划分,得到业务层特征维度和行为层特征维度。通过业务层特征维度和行为层特征维度,可使电子设备快速地在多个行为维度下提取特征值。
如图3所示的行为维度。其中,业务层特征维度可以包括:即时通讯(英文:Instant Messaging,简称:IM)、网页浏览、社区论坛、流量、文件传输和邮件等。行为层特征维度可以包括:发送信息、接收信息、发送文件、文件传输协议(英文:File Transfer Protocol,简称:FTP)流量、以安全为目标的超文本传输协议通道(英文:Hyper Text Transfer Protocol over Secure Socket Layer,简称:HTTPS)流量和接收邮件等。
电子设备通过将前述两层特征维度包括的内容进行任意组合,得到多个行为维度。如图3所示,在一个示例中,电子设备得到的行为维度包括但不限于:IM发送信息数、IM接收信息数、IM发送文件数、IM发送文件大小等。
进而,在多个行为维度下,电子设备提取到多个特征值。
步骤103,电子设备根据多个特征值,确定用户行为数据对应的特征向量。
以一个用户行为数据为例说明。电子设备对这一个用户行为数据对应的多个特征值进行组合处理,得到这一个用户行为数据对应的特征向量。
步骤104,通过预设的聚类算法,对特征向量进行聚类处理,得到多个聚合类,并获取每个聚合类的中心向量。
本公开实施例中,预设的聚类算法可以为K-means聚类算法、K-means Plus聚类算法等。电子设备通过预设的聚类算法,对特征向量进行聚类处理,得到多个聚合类。每个聚合类中包括至少一个特征向量。
以一个聚合类为例说明。电子设备计算这一个聚合类包括的多个特征向量的均值,将该均值作为这一个聚合类的中心向量。
步骤105,电子设备确定差异特征向量,差异特征向量与所属聚合类的中心向量之间的距离值未在预设距离值范围内。
本公开实施例中,预设距离值范围在先已存储至电子设备中。
具体地,与所属聚合类的中心向量的距离值未在预设距离值范围内具体是指:聚合类中特征向量与聚合类的中心向量的距离值小于预设距离值范围的最小值;或与聚合类中特征向量与聚合类的中心向量的距离值大于预设距离值范围的最大值。
可以理解的是,出现前述聚合类中特征向量与聚合类的中心向量的距离值小于预设距离值范围的最小值的情况,或出现前述聚合类中特征向量与聚合类的中心向量的距离值大于预设距离值范围的最大值的情况,电子设备确定特征向量为差异特征向量。
对于每个聚合类,以一个聚合类为例说明。电子设备计算这一个聚合类包括的每个特征向量分别与这一个聚合类的中心向量之间的距离值。电子设备得到多个距离值后,对这多个距离值进行排序。电子设备获取未在预设距离值范围内的距离值,并将获取的距离值所表征的特征向量作为差异特征向量。
步骤106,电子设备将差异特征向量所表征的用户确定为异常用户。
例如,电子设备提到用户Q
1的用户行为数据P
1在预设的多个行为维度下的多个特征值,并根据提到到的多个特征值,确定用户行为数据P
1对应的特征向量111。若电子设备确定特征向量111为差异特征向量,则确定用户Q
1确定为异常用户。
本公开实施例提供的异常用户识别方法中,电子设备通过对特征向量进行聚类处理,获取与聚合类的中心向量的距离未在预设距离值范围内的差异特征向量。电子设备根据获取的差异特征向量实现对异常用户的识别。管理人员无需再将限制的用户名加入黑名单中,电子设备也无需再通过建立黑名单的方式对异常用户进行识别,实现了对管理人员未知且无法发现异常行为的用户进行识别。
在一种实现方式中,电子设备获取的用户行为数据为多个用户的用户行为数据,本公开实施例提供了一种异常用户识别方法。参考图2,图2为本公开实施例提供的异常用户识别方法的另一种流程示意图,该方法包括如下步骤。
步骤201:电子设备获取多个用户的用户行为数据。
本公开实施例中,当需要检测异常用户时,电子设备获取多个用户的用户行为数据。
电子设备可以从用户行为日志中,获取到多个用户的用户行为数据。这里,用户行为日志用于记录用户的各种网络行为。另外,电子设备也可以从用户输入的用户行为数据中获取到多个用户的用户行为数据。本公开实施例不限定电子设备获取用户行为数据的形式。
在本公开实施例中,电子设备可以根据预先设置的时间粒度,获取不同用户的用户行为数据。其中,电子设备可根据对异常用户识别的多种需求,设置不同的时间粒度。
例如,对存在长期经营和策划的高级持续性威胁(英文:Advanced Persistent Threat,简称:APT)的用户进行识别时,电子设备可以预先设置较大的时间粒度。如,电子设备预先设置时间粒度可以为:一周、一个月等。
再例如,对在离职前突发攻击行为的用户进行识别时,电子设备可以预先设置较小的时间粒度。如,电子设备预先设置时间粒度可以为:10分钟,1小时,24小时等。
在预设的时间粒度内,电子设备获取多个用户的用户行为数据。
在一个示例中,假设当前时间为10:00,电子设备预设的时间粒度为10分钟,待识别的用户包括A、B和C。在10:00-10:10所表示的时间段内,电子设备可获取用户A的用户行为数据11、用户B的用户行为数据12和用户C的用户行为数据13。在9:50-10:00所表示的时间段内,电子设备也可获取用户A的用户行为数据21、用户B的用户行为数据22和用户C的用户行为数据23。
步骤202:电子设备提取每个用户的用户行为数据在预设的多个用户行为维度下的多个用户特征值。
具体地,为了便于电子设备提取每个用户的用户行为数据在多个用户行为维度下的用户特征值,可对用户行为维度进行划分,得到业务层特征维度和行为层特征维度。通过业务层特征维度和行为层特征维度,可使电子设备快速地在多个用户行为维度下提取用户特征值。
如图3所示的用户行为维度。其中,业务层特征维度可以包括:即IM、网页浏览、社区论坛、流量、文件传输和邮件等。行为层特征维度可以包括:发送信息、接收信息、发送文件、FTP流量、HTTPS流量和接收邮件等。
电子设备通过将前述两层特征维度包括的内容进行任意组合,得到多个用户行为维度。如图3所示,在一个示例中,电子设备得到的用户行为维度包括但不限于:IM发送信息数、IM接收信息数、IM发送文件数、IM发送文件大小等。
进而,在多个用户行为维度下,电子设备提取到多个用户中每个用户的多个用户特征值。
步骤203:电子设备根据多个用户中每个用户的多个用户特征值,确定多个用户中每个用户的用户特征向量。
对于多个用户中的每个用户,以一个用户为例说明。电子设备对这一个用户的多个用户特征值进行组合处理,得到这一个用户的用户特征向量。
具体地,按照前述步骤的示例,电子设备获取到用户A的用户行为数据11、用户B的用户行为数据12和用户C的用户行为数据13。
电子设备从用户行为数据11中提取到IM发送信息数为10、IM接收信息数为8、IM发送文件数为2、IM发送文件大小为500KB。
电子设备从用户行为数据12中提取到IM发送信息数为9、IM接收信息数为8、IM发送文件数为3、IM发送文件大小为490KB。
电子设备从用户行为数据13中提取到IM发送信息数为10、IM接收信息数为7、IM发送文件数为1、IM发送文件大小为600KB。
此时,电子设备可确定每个用户的用户特征向量为:用户A的用户特征向量01为{10,8,2,500},用户B的用户特征向量02为{9,8,3,490},用户C的用户特征向量03为{10,7,1,600}。
步骤204:电子设备通过预设的聚类算法,对多个用户的用户特征向量进行聚类处理,得到多个用户类。
本公开实施例中,预设的聚类算法可以为K-means聚类算法、K-means Plus聚类算法等。电子设备通过预设的聚类算法,对多个用户的用户特征向量进行聚类处理,得到多个用户类。每个用户类中包括至少一个用户特征向量。
在一个示例中,预设的聚类算法为K-means聚类算法。电子设备通过K-means聚类算法,对多个用户的用户特征向量进行聚类处理,得到K个初始用户类。其中,K为正整数。电子设备将这K个初始用户类作为K个用户类。
步骤205:电子设备根据多个用户类中每个用户类包括的用户特征向量,确定多个用户类中每个用户类的中心向量。
对于每个用户类,以一个用户类为例说明。电子设备计算这一个用户类包括的多个用户特征向量的均值,将该均值作为这一个用户类的中心向量。
按照前述步骤的示例,电子设备对用户特征向量进行聚类处理后,得到多个用户类。假设多个用户类包括用户类1,用户类1包括用户A的用户特征向量01、用户B的用户特征向量02和用户C的用户特征向量03。
电子设备计算用户特征向量01、用户特征向量02和用户特征向量03的均值t
1,将均值t
1确定为用户类1的中心向量。
步骤206:电子设备获取多个用户类中每个用户类的差异特征向量。
本公开实施例中,差异特征向量为:用户类中与用户类的中心向量的距离值未在预设距离值范围内的用户特征向量,即为差异特征向量与所属用户类的中心向量之间的距离值未在预设距离值范围内。预设距离值范围在先已存储至电子设备中。
具体地,用户类中与用户类的中心向量的距离值未在预设距离值范围内具体是指:用户类中用户特征向量与用户类的中心向量的距离值小于预设距离值范围的最小值;或用户类中用户特征向量与用户类的中心向量的距离值大于预设距离值范围的最大值。
可以理解的是,出现前述用户类中用户特征向量与用户类的中心向量的距离值小于预设距离值范围的最小值的情况,或出现前述用户类中用户特征向量与用户类的中心向量的距离值大于预设距离值范围的最大值的情况,电子设备确定用户特征向量为差异特征向量。
对于每个用户类,以一个用户类为例说明。电子设备计算这一个用户类包括的每个用户特征向量分别与这一个用户类的中心向量之间的距离值。电子设备得到多个距离值后,对这多个距离值进行排序。电子设备获取未在预设距离值范围内的距离值,并将获取的距离值所表征的用户特征向量作为差异特征向量。
按照前述步骤的示例,假设预设距离值范围d
1~d
2。用户类1包括用户A的用户特征向量01、用户B的用户特征向量02和用户C的用户特征向量03,用户类1的中心向量为t
1。用户特征向量01与中心向量t
1之间的距离为d
01,用户特征向量02与中心向量t
1之间的距离为d
02,用户特征向量03与中心向量t
1之间的距离为d
03。若d
01<d
1,d
1<d
02<d
2,d
1<d
03<d
2,则电子设备将d
01所表征的用户特征向量01确定为差异特征向量。
在不同的用户类中,用户特征向量的分布不同。本公开实施例中,为了提高电子设备获取差异特征向量的准确性,电子设备中可分别存储每个用户类的预设距离值范围。
步骤207:电子设备将差异特征向量所表征的用户确定为异常用户。
具体地,按照前述步骤的示例,电子设备将用户特征向量01确定为差异特征向量,则确定用户特征向量01所表征的用户确定为异常用户,即将用户A确定为异常用户。
因此,本公开实施例提供的技术方案中,电子设备通过对用户特征向量进行聚类处理,得到用户类中的差异特征向量。电子设备根据差异特征向量实现对异常用户的识别。管理人员无需再将限制的用户名加入黑名单中,电子设备也无需再通过建立黑名单的方式对异常用户进行识别。本公开实施例提供的异常用户识别方法,实现了对管理人员未知且无法发现异常行为的用户进行识别。
可选地,在一种实现方式中,为了避免用户类包括的用户特征向量的个数存在过少的情况,导致聚类效果不理想,异常用户识别不准确。电子设备存储了预先设置的数量阈值,数量阈值用于对用户类包括的用户特征向量的个数进行限制。电子设备通过预设的聚类算法,对多个用户的用户特征向量进行聚类处理,得到多个用户类(步骤204),可包括如下步骤。
电子设备通过K-means聚类算法,对多个用户的用户特征向量进行聚类处理,得到K个初始用户类。
电子设备检测K个初始用户类中是否存在包括用户特征向量的个数小于数量阈值的初始用户类。如果不存在包括用户特征向量的个数小于数量阈值的初始用户类,则电子设备将这K个初始用户类作为K个用户类。
如果存在包括用户特征向量的个数小于数量阈值的初始用户类,则电子设备获取K个初始用户类中的第一初始用户类和第二初始用户类。
在本公开实施例中,第一初始用户类为:K个初始用户类中,包括用户特征向量的个数小于预设数量阈值的初始用户类。第二初始用户类为:K个初始用户类中,与第一初始用户类的中心向量的距离值最小的中心向量所表征的初始用户类。
之后,电子设备对第一初始用户类与第二初始用户类进行合并处理,得到合并初始用户类。
电子设备将合并初始用户类作为聚类处理后的用户类,并将K个初始用户类中未合并的其他初始用户类作为聚类处理后的用户类。进而,电子设备得到多个用户类。
例如,预设的数量阈值为10。电子设备通过K-means聚类算法,对多个用户的用户特征向量进行聚类处理,得到5个初始用户类。例如,初始用户类1,初始用户类2,初始用户类3,初始用户类4和初始用户类5。其中,初始用户类1中包括8个用户特征向量,初始用户类2中包括12个用户特征向量,初始用户类3中包括11个用户特征向量,初始用户类4中包括15个用户特征向量,初始用户类5中包括17个用户特征向量。
可见,8<10,也就是,初始用户类1包括的用户特征向量的个数小于数量阈值,初始用户类1为第一初始用户类。
电子设备计算初始用户类2的中心向量与初始用户类1的中心向量之间的距离值为d
11。电子设备计算初始用户类3的中心向量与初始用户类1的中心向量之间的距离值为d
12。电子设备计算初始用户类4的中心向量与初始用户类1的中心向量之间的距离值为d
13。电子设备计算初始用户类5的中心向量与初始用户类1的中心向量之间的距离值为d
14。
若d
11<d
12<d
13<d
14,d
11为最小的距离值,且d
11对应初始用户类2,则电子设备可确定初始用户类2为第二初始用户类。电子设备对初始用户类1与初始用户类2进行合并处理,得到合并初始用户类1。
电子设备将合并初始用户类1作为进行聚类处理后的用户类01,将未合并的初始用户类3作为进行聚类处理后的用户类03,将初始用户类4作为进行聚类处理后的用户类04,将初始用户类5作为进行聚类处理后的用户类05。这样,电子设备得到4个用户类。
一种实现方式中,为了获得较好地聚类效果,电子设备在得到多个用户类后,可通过计算用户特征向量的聚合值的方式,对得到的多个用户类进行合并处理。其中,聚合值用于表征用户特征向量归属于用户类的合理程度。
在一个示例中,电子设备可以采用以下步骤获得聚合值。
电子设备计算第一用户特征向量分别与每个第二用户特征向量之间的第一距离值。其中,第二用户特征向量为:第一用户特征向量所在用户类包括的除第一用户特征向量之外的用户特征向量。电子设备对多个第一距离值进行取均值处理,得到第一距离均值。
电子设备计算第一用户特征向量分别与每个第三用户特征向量之间的第二距离值。其中,第三用户特征向量为:除第一用户特征向量所在用户类之外的每个用户类包括的用户特征向量。电子设备对多个 属于同一用户类的第二距离值进行取均值处理,得到多个第二距离均值。电子设备获取多个第二距离均值中的距离均值最小值。
之后,电子设备计算第一距离均值和距离均值最小值的比值,将第一距离均值和距离均值最小值的比值,作为第一用户特征向量的聚合值。
以上仅以第一用户特征向量为例进行说明,并不起限定作用。
例如,如图4所示的用户类的分布图。图4中每一黑色的圆点表示一个用户特征向量。图4中包括用户类11、用户类12和用户类13。以用户类11包括的用户特征向量L
11为例,在计算聚合值时,电子设备计算L
11与用户类11包括的用户特征向量L
12之间的第一距离值d
21,计算L
11与用户类11包括的用户特征向量L
13之间的第一距离值d
22,计算L
11与用户类11包括的用户特征向量L
14之间的第一距离值d
23。电子设备计算d
21、d
22和d
23的均值,得到第一距离均值D
1。
电子设备计算L
11与用户类12包括的用户特征向量L
21之间的第二距离值d
24,计算L
11与用户类12包括的用户特征向量L
22之间的第二距离值d
25,计算L
11与用户类12包括的用户特征向量L
23之间的第二距离值d
26。电子设备计算d
24、d
25和d
26的均值,得到第二距离均值D
2。
电子设备计算L
11与用户类13包括的用户特征向量L
31之间的第二距离值d
27,计算L
11与用户类13包括的用户特征向量L
32之间的第二距离值d
28,计算L
11与用户类13包括的用户特征向量L
33之间的第二距离值d
29。电子设备计算d
27、d
28和d
29的均值,得到第二距离均值D
3。
若D
2<D
3,则电子设备计算D
1和D
2的比值,即D
1/D
2,将D
1/D
2作为用户特征向量L
11的聚合值J
11。
同理,电子设备可以计算出用户类11包括的其他用户特征向量的聚合值,以及用户类12和用户类13包括的用户特征向量的聚合值,在此不再复述。
可选地,基于上述确定的聚合值,电子设备对得到的多个用户类进行合并处理的过程可包括如下步骤。
电子设备计算多个用户类中任意两个用户类的中心向量之间的距离值,得到多个距离值。
从得到的多个距离值中,电子设备获取最小的距离值,并确定最小的距离值所表征的第一用户类和第二用户类。
电子设备获取多个用户类中每个用户类包括的用户特征向量的第一聚合值。这里,电子设备可以得到多个第一聚合值。
另外,当电子设备将第一用户类和第二用户类作为合并用户类时,也就是,电子设备将第一用户类和第二用户类作为一个用户类时,电子设备获取合并用户类包括的用户特征向量的第二聚合值,并获取多个用户类中除合并用户类外的每个用户类包括的用户特征向量的第二聚合值。这里,电子设备可以得到多个第二聚合值。
电子设备对多个第一聚合值进行累加处理,得到第一和值。电子设备对多个第二聚合值进行累加处理,得到第二和值。这里,多个用户类包括的所有用户特征向量的聚合值的和值,用于评价聚类效果的好坏。
当第二和值小于第一和值时,电子设备确定将第一用户类和第二用户类合并后的聚类效果更好,对第一用户类和第二用户类进行合并处理。
之后,电子设备重新计算多个用户类中任意两个用户类的中心向量之间的距离值,确定得到的多个距离值中最小距离值所表征的两个用户类,对这两个用户类进行合并处理,直至第二和值不小于第一和 值为止。
仍以图4为例进行说明,电子设备计算得到:用户类11的中心向量与用户类12的中心向量之间的距离值z
1,用户类11的中心向量与用户类13的中心向量之间的距离值z
2,用户类12的中心向量与用户类13的中心向量之间的距离值z
3。若z
1<z
2<z
3,z
1最小,则确定z
1所表征的用户类11作为第一用户类,z
1所表征的用户类12作为第二用户类。
对于用户类11,电子设备计算得到:用户特征向量L
11的聚合值J
11,用户特征向量L
12的聚合值J
12,用户特征向量L
13的聚合值J
13,用户特征向量L
14的聚合值J
14。对于用户类12,电子设备计算得到:用户特征向量L
21的聚合值J
21,用户特征向量L
22的聚合值J
22,用户特征向量L
23的聚合值J
23。对于用户类13,电子设备计算得到:用户特征向量L
31的聚合值J
31,用户特征向量L
32的聚合值J
32,用户特征向量L
33的聚合值J
33。
另外,电子设备将用户类11和用户类12作为合并用户类01。对于合并用户类01,电子设备计算得到:用户特征向量L
11的聚合值J
01,用户特征向量L
12的聚合值J
02,用户特征向量L
13的聚合值J
03,用户特征向量L
14的聚合值J
04,用户特征向量L
21的聚合值J
05,用户特征向量L
22的聚合值J
06,用户特征向量L
23的聚合值J
07。对于用户类13,电子设备计算得到:用户特征向量L
31的聚合值J
08,用户特征向量L
32的聚合值J
09,用户特征向量L
33的聚合值J
10。
电子设备计算第一和值M
1为:M
1=J
11+J
12+J
13+J
14+J
21+J
22+J
23+J
31+J
32+J
33。
电子设备计算第二和值M
2为:M
2=J
01+J
02+J
03+J
04+J
05+J
06+J
07+J
08+J
09+J
10。
若M
2<M
1,则电子设备对用户类11和用户类12进行合并处理,得到合并用户类01。否则,电子设备不对用户类11和用户类12进行合并处理。
在一个示例中,为了获得较好地聚类效果,电子设备还可以采用以下步骤获得聚合值。
如上述电子设备确定第一距离均值,以及确定多个第二距离均值中的距离均值最小值的过程。电子设备在计算得到距离均值最小值和第一距离均值的比值之后,将该比值减去1,得到结果为第一用户特征向量的聚合值。
仍以图4中用户类11包括的用户特征向量L
11为例进行说明。电子设备计算得到D
1、D
2和D
3,其中,D
2<D
3。电子设备计算得到D
2和D
1的比值,即D
2/D
1。之后,电子设备将(D
2/D
1-1)作为用户特征向量L
11的聚合值J
11。
在一个示例中,为了获得较好地聚类效果,电子设备还可以采用以下步骤获得聚合值。
如上述电子设备确定第一距离均值,以及确定多个第二距离均值中的距离均值最小值的过程。电子设备在计算得到第一距离均值和距离均值最小值的比值之后,将1减去该比值,得到结果为第一用户特征向量的聚合值。
仍以图4中用户类11包括的用户特征向量L
11为例进行说明。电子设备计算得到D
1、D
2和D
3,其中,D
2<D
3。电子设备计算得到D
1和D
2的比值,即D
1/D
2。之后,电子设备将(1-D
1/D
2)作为用户特征向量L
11的聚合值J
11。
可选地,基于1减去比值所得到的聚合值,或比值减去1所得到的聚合值,电子设备对得到的多个用户类进行合并处理的过程可以包括如下步骤。
电子设备计算多个用户类中任意两个用户类的中心向量之间的距离值,得到多个距离值。
从得到的多个距离值中,电子设备获取最小的距离值,确定最小的距离值所表征的第一用户类和第 二用户类。
电子设备获取多个用户类中每个用户类包括的用户特征向量的第一聚合值。这里,电子设备可以得到多个第一聚合值。
另外,当电子设备将第一用户类和第二用户类作为合并用户类时,也就是,电子设备将第一用户类和第二用户类作为一个用户类时,电子设备获取合并用户类包括的用户特征向量的第二聚合值,并获取多个用户类中除合并用户类外的每个用户类包括的用户特征向量的第二聚合值。这里,电子设备可以得到多个第二聚合值。
电子设备对多个第一聚合值进行累加处理,得到第一和值。电子设备对多个第二聚合值进行累加处理,得到第二和值。
当第二和值大于第一和值时,电子设备确定将第一用户类和第二用户类合并后的聚类效果更好,对第一用户类和第二用户类进行合并处理。
之后,电子设备重新计算多个用户类中任意两个用户类的中心向量之间的距离值,确定得到的多个距离值中最小距离值所表征的两个用户类,对这两个用户类进行合并处理,直至第二和值不大于第一和值为止。
可选地,在一种实现方式中,为了提高聚类处理的速度,电子设备可以根据多个用户中每个用户的用户属性,先对多个用户进行粗分类,得到每个用户所属的粗分类。对于每个粗分类,以一个粗分类为例。电子设备通过预设的聚类算法,对这一个粗分类包括的多个用户特征向量进行聚类处理,得到多个用户类。
例如,用户属性包括职位属性。职位属性包括:会记、出纳、人力资源、客服、研发设计等等。根据用户的职位属性,对用户进行粗分类。如,将会记、出纳等属于财务部的用户划分至一个粗分类,将人力资源等属于人事部的用户划分至一个粗分类,将客服等属于行政部的用户划分至一个粗分类,将研发设计等属于设计部的用户划分至一个粗分类,等等。
在进行聚类处理时,电子设备通过预设的聚类算法,分别对设计部、财务部、行政部、人事部这四个粗分类中每个粗分类包括的多个用户的用户特征向量进行聚类处理,得到多个用户类。
可选地,为了提高电子设备获取差异特征向量的准确性,电子设备预先存储了每个用户类的距离值范围。距离值范围用于对用户类包括的用户特征向量与用户类的中心向量之间的距离值进行限制。
在一种实现方式中,电子设备可以采用以下步骤确定距离值范围。
电子设备计算用户类X的中心向量,分别与用户类X包括的每个用户特征向量的距离值,得到多个距离值。用户类X为任一用户类。
电子设备计算多个距离值的距离均值,作为第三距离均值。电子设备还计算多个距离值的标准差,作为第一标准差。电子设备根据第三距离均值和第一标准差,构建正态分布曲线。该正态分布曲线:用于表征用户类X的中心向量与用户类X包括的用户特征向量之间的距离值分布。
基于正态分布曲线,电子设备根据第三距离均值和第一标准差,确定第一边界值和第二边界值。其中,第一边界值小于第三距离均值,第一边界值与第三距离均值的差的绝对值为:预设倍数的第一标准差。第二边界值大于第三距离均值,第二边界值与第三距离均值的差的绝对值同样为:预设倍数的第一标准差。
电子设备将第一边界值和第二边界值组成的区间,确定为用户类X的距离值范围。
在一个例子中,预设倍数为3。此时,电子设备基于3倍标准差,确定用户类X的距离值范围,如图5所示。图5中,μ
1为第三距离均值,s为第一标准差,距离值范围为μ
1-3s~μ
1+3s。
在正态分布曲线中,与第三距离均值μ
1的距离大于3倍标准差的数据属于小概率事件,其表征不可能发生的事件。对于用户类X包括的用户特征向量,若用户特征向量与用户类X的中心向量的距离值未在距离值范围内,则电子设备可以认定这个用户特征向量为差异特征向量。
但是,在实际应用中,用户类的中心向量与用户类包括的用户特征向量的距离值分布不是均符合正态分布曲线。在另一种实现方式中,电子设备可以采用以下方式确定距离值范围。
电子设备计算用户类X的中心向量分别与用户类X包括的每个用户特征向量的距离值,得到多个距离值。
电子设备根据预设的对数函数,计算多个距离值中每个距离值的对数值。电子设备还计算多个对数值的均值,作为对数均值。电子设备还计算多个对数值的标准差,作为第二标准差。电子设备根据对数均值和第二标准差,构建正态分布曲线。该正态分布曲线:用于表征用户类X的中心向量与用户类X包括的用户特征向量之间的距离值的对数分布。
基于正态分布曲线,电子设备根据对数均值和第二标准差,确定第三边界值和第四边界值。其中,第三边界值小于对数均值,第三边界值与对数均值的差的绝对值为:预设倍数第二标准差。第四边界值大于对数均值,第二边界值与对数均值的差的绝对值同样为:预设倍数第二标准差。
电子设备根据预设的对数函数的反函数,计算第三边界值的反对数值作为第一反对数值,计算第四边界值的反对数值作为第二反对数值。例如,预设的对数函数为y=log
10x,则预设的对数函数的反函数为x=10
y。
电子设备将第一反对数值和第二反对数值组成的区间,确定为用户类X的距离值范围。
可选地,在一种实现方式中,为了提高电子设备确定异常用户的准确性,根据多个用户行为维度中的每个用户行为维度,电子设备判断差异特征向量对应的用户特征值是否超过预设的特征基线值。一个例子中,每个用户行为维度,电子设备预设有一个特征基线值。
如果差异特征向量对应的用户特征值超过特征基线值,则电子设备可确定在用户行为维度下所表征的用户行为为异常用户行为,并确定差异特征向量所表征的用户为异常用户。
如果差异特征向量对应的用户特征值未超过特征基线值,则电子设备确定在用户行为维度下所表征的用户行为为正常用户行为。若差异特征向量对应的所有用户特征值均未超过特征基线值,则电子设备确定差异特征向量所表征的用户为正常用户。
例如,用户行为维度1的特征基线值为X
1,用户行为维度2的特征基线值为X
2,用户行为维度3的特征基线值为X
3。差异特征向量包括用户行为维度1的用户特征值1,用户行为维度2的用户特征值2,用户行为维度3的用户特征值3。
对于用户行为维度1,若用户特征值1超过特征基线值X
1,则电子设备可确定在用户行为维度1下所表征的用户行为为异常用户行为,差异特征向量所表征的用户为异常用户。
对于用户行为维度2,若用户特征值2超过特征基线值X
2,则电子设备可确定在用户行为维度2下所表征的用户行为为异常用户行为,差异特征向量所表征的用户为异常用户。
对于用户行为维度3,若用户特征值3超过特征基线值X
3,则电子设备可确定在用户行为维度3下所表征的用户行为为异常用户行为,差异特征向量所表征的用户为异常用户。
若用户特征值1未超过特征基线值X
1,用户特征值2未超过特征基线值X
2,且用户特征值3未超过特征基线值X
3,则电子设备可确定差异特征向量所表征的用户为正常用户。
本公开实施例中,对于用户特征值差异性较小的用户行为维度,例如,用户行为维度1,电子设备可以直接确定用户行为维度1的特征基线值。
例如,用户切换MAC地址频率一般为1天为1次或2次,此时,电子设备可以确定MAC地址切换频率这个用户行为维度的特征基线值为2。
对于用户特征值差异性较大的用户行为维度,例如,用户行为维度2,电子设备统计多个用户行为数据在用户行为维度2下的用户特征值的概率密度分布。电子设备根据概率密度分布,确定用户行为维度2的特征基线值。
例如,如图6所示的累计概率曲线图。图6中,横轴为用户特征值,纵轴为累计概率。坐标轴内的矩形为用户特征值的概率密度。累计概率曲线为基于概率密度分布获得的。从图6中可以看出,用户特征值在20-120区间时,累计概率曲线的斜率远小于平均斜率。此时,电子设备可以确定图6所表征的用户行为维度的特征基线值为:小于20或大于120。
一种实现方式中,若电子设备获取的多个用户行为数据还可以为一个用户的多个用户行为数据,本公开实施例还提供了一种异常用户识别方法。参考图7,图7为本公开实施例提供的异常用户识别方法的再一种流程示意图,该方法包括如下步骤。
步骤701:电子设备获取待识别用户的多个用户行为数据。多个用户行为数据包括至少一个历史用户行为数据和一个当前用户行为数据。
本公开实施例中,当需要检测待识别用户是否为异常用户时,电子设备获取待识别用户的多个用户行为数据。此处仅以待识别用户为例进行说明,并不起限定作用。
电子设备可以从用户行为日志中,获取到待识别用户的多个用户行为数据。这里,用户行为日志用于记录用户的各种网络行为。另外,电子设备也可以从用户输入的用户行为数据中,获取到待识别用户的多个用户行为数据。本公开实施例不限定电子设备获取用户行为数据的形式。
在本公开实施例中,电子设备可以根据预先设置的时间粒度,获取不同用户的用户行为数据。其中,电子设备可根据对异常用户识别的多种需求,设置不同的时间粒度。
电子设备按照预设的时间粒度,获取待识别用户的多个用户行为数据。
在一个示例中,假设当前时间为10:00,待识别用户为用户A1。电子设备预设的时间粒度为10分钟。电子设备可获取在9:50-10:00所表示的时间段内用户A1的用户行为数据31,在9:40-9:50所表示的时间段内用户A1的用户行为数据32,以及在9:30-9:40所表示的时间段内用户A1的用户行为数据33等。其中,用户行为数据31为用户A1的当前用户行为数据。用户行为数据32和用户行为数据33等为用户A1的历史用户行为数据。
在另一个示例中,假设当前时间为10:00,待识别用户为用户A1。电子设备预设的时间粒度为10分钟。电子设备可获取在10:00-10:10所表示的时间段内用户A1的用户行为数据41,在9:50-10:00所表示的时间段内用户A1的用户行为数据42,在9:40-9:50所表示的时间段内用户A1的用户行为数据43,以及在9:30-9:40所表示的时间段内用户A1的用户行为数据44等。其中,用户行为数据41为用户A1的当前用户行为数据。用户行为数据42、用户行为数据43和用户行为数据44等为用户A1的历史用户行为数据。
步骤702:电子设备提取每个历史用户行为数据在预设的多个用户行为维度下的多个第一数据特征 值,并提取当前用户行为数据在多个用户行为维度下的多个第二数据特征值。
具体地,为了便于电子设备提取每个用户行为数据在多个用户行为维度下的数据特征值,可对用户行为维度进行划分,得到业务层特征维度和行为层特征维度。通过业务层特征维度和行为层特征维度,可使电子设备快速地在多个用户行为维度下提取数据特征值。
电子设备通过将业务层特征维度和行为层特征维度包括的内容进行任意组合,得到多个用户行为维度。如图3所示,在一个示例中,电子设备得到的用户行为维度包括但不限于:IM发送信息数、IM接收信息数、IM发送文件数、IM发送文件大小等。
进而,在多个用户行为维度下,电子设备提取到多个用户行为数据中每个历史用户行为数据的多个第一数据特征值,以及提取到多个用户行为数据中当前用户行为数据的多个第二数据特征值。
步骤703:电子设备根据多个第一数据特征值,确定每个历史用户行为数据的第一数据特征向量,并根据多个第二数据特征值,确定当前用户行为数据的第二数据特征向量。
对于多个用户行为数据中的每个历史用户行为数据,以一个历史用户行为数据为例说明。电子设备对这一个历史用户行为数据的多个第一数据特征值进行组合处理,得到这一个历史用户行为数据的第一数据特征向量。
对于多个用户行为数据中的当前用户行为数据,电子设备对当前用户行为数据的多个第二数据特征值进行组合处理,得到当前用户行为数据的第二数据特征向量。
具体地,按照前述步骤的示例,电子设备获取到用户A1的用户行为数据31、用户A1的用户行为数据32和用户A1的用户行为数据33。
电子设备从用户行为数据31中提取到IM发送信息数为10、IM接收信息数为8、IM发送文件数为2、IM发送文件大小为500KB。
电子设备从用户行为数据32中提取到IM发送信息数为9、IM接收信息数为8、IM发送文件数为3、IM发送文件大小为490KB。
电子设备从用户行为数据33中提取到IM发送信息数为10、IM接收信息数为7、IM发送文件数为1、IM发送文件大小为600KB。
此时,电子设备可以确定:用户行为数据31的数据特征向量01为{10,8,2,500},用户行为数据32的数据特征向量02为{9,8,3,490},用户行为数据33的数据特征向量03为{10,7,1,600}。其中,数据特征向量01为第二数据特征向量,数据特征向量02和数据特征向量03为第一数据特征向量。
步骤704:电子设备通过预设的聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到多个数据类。
本公开实施例中,预设的聚类算法可以为K-means聚类算法、K-means Plus聚类算法等。电子设备通过预设的聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到多个数据类。每个数据类包括至少一个数据特征向量。
在一个示例中,预设的聚类算法为K-means聚类算法。电子设备通过K-means聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到K个初始数据类。其中,K为正整数。电子设备将这K个初始数据类作为K个数据类。
步骤705:电子设备确定第二数据特征向量所属的第一数据类的第一中心向量。
本公开实施例中,电子设备从多个数据类中,确定第二数据特征向量所属的第一数据类,计算第一 数据类包括的多个数据特征向量的均值,将该均值作为第一数据类的中心向量,以确定当前待识别用户是否为异常用户。其中,第一数据类的中心向量即为第一中心向量。
按照前述步骤的示例,第一数据类中包括数据特征向量01、数据特征向量02和数据特征向量03。电子设备计算数据特征向量01、数据特征向量02和数据特征向量03的均值t
2,将计算得到的均值t
2确定为第一数据类的第一中心向量。
步骤706:电子设备确定第二数据特征向量与第一中心向量之间的距离值。
按照前述步骤的示例,数据特征向量01为第二数据特征向量,第一数据类的中心向量为t
2。电子设备计算数据特征向量01与中心向量t
2之间的距离值d
a1。
步骤707:若距离值未在预设距离范围内,电子设备确定待识别用户为异常用户。
电子设备确定第二数据特征向量与第一中心向量之间的距离值,判断确定的距离值是否在预设距离值范围内。若未在预设距离范围内,则电子设备可确定第二数据特征向量为差异特征向量,确定第二数据特征向量所表征的用户为异常用户,即确定待识别用户为异常用户。
电子设备预设有距离值范围。不同的数据类中,数据特征向量的分布不同。为了提高电子设备识别异常用户的准确性,电子设备可预先设置第一数据类的距离值范围。
本公开实施例中,对于第二数据特征向量与第一数据类的中心向量之间的距离值,电子设备若确定这个距离值未在预设距离值范围内,则确定待识别用户为异常用户。电子设备若这个距离值在预设距离值范围内,则确定待识别用户为正常用户。
按照前述步骤的示例,预设距离值范围为d
a01~d
a02。电子设备计算得到数据特征向量01与中心向量t
2之间的距离值d
a1。若d
a1<d
a01或d
a1>d
a02,则电子设备可确定待识别用户为异常用户,即电子设备可确定用户A1为异常用户。
本公开实施例中,电子设备通过对数据特征向量进行聚类处理,得到当前用户行为数据类所属的第一数据类。电子设备根据第一数据类中第二数据特征向量与第一数据类的中心向量的距离,实现对异常用户的识别。管理人员无需再将限制的用户名加入黑名单中,电子设备也无需再通过建立黑名单的方式对异常用户进行识别。本公开实施例提供的异常用户识别方法,实现了对管理人员未知且无法发现异常行为的用户进行识别。
可选地,在一种实现方式中,为了避免一个数据类中包括的数据特征向量的个数存在过少的情况,导致聚类效果不理想,异常用户识别不准确。电子设备存储了预先设置的数量阈值,数量阈值用于对数据类包括的数据特征向量的个数进行限制。电子设备通过预设的聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到多个数据类(步骤704),可包括如下步骤。
电子设备通过K-means聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到K个初始数据类。
电子设备获取K个初始数据类中的第一初始数据类。其中,第一初始数据类包括N个数据特征向量,N为正整数。第一初始数据类为:K个初始数据类中第二数据特征向量所属的初始数据类。
电子设备检测N是否小于数量阈值。如果N不小于数量阈值,则电子设备将这K个初始数据类作为K个数据类。
如果N小于预设数量阈值,则电子设备获取K个初始数据类中的第二初始数据类。其中,第二初始数据类为:K个初始数据类中,与第一初始数据类的中心向量的距离值最小的中心向量所表征的初始数 据类。
之后,电子设备对第一初始数据类与第二初始数据类进行合并处理,得到合并初始数据类。
电子设备将合并初始数据类作为聚类处理后的数据类,并将K个初始数据类中未合并的其他初始数据类作为聚类处理后的数据类。进而,电子设备得到多个数据类。
一种实现方式中,为了获得较好地聚类效果,电子设备可通过计算数据特征向量的聚合值的方式,对得到的多个数据类再次进行合并处理。其中,聚合值用于表征数据特征向量归属于数据类的合理程度。
在一个示例中,在一种实现方式中,电子设备可以采用以下步骤获得聚合值。
电子设备计算第三数据特征向量分别与每个第四数据特征向量之间的第一距离值。其中,第四数据特征向量为:第三数据特征向量所在数据类包括的除第三数据特征向量之外的数据特征向量。电子设备对多个第一距离值进行取均值处理,得到第一距离均值。
电子设备计算第三数据特征向量与每个第五数据特征向量之间的第二距离值。其中,第五数据特征向量为:除第三数据特征向量所在数据类之外的每个数据类包括的数据特征向量。电子设备对多个属于同一数据类的第二距离值进行取均值处理,得到多个第二距离均值。电子设备获取多个第二距离均值中的距离均值最小值。
之后,电子设备计算第一距离均值和距离均值最小值的比值,得到第三数据特征向量的聚合值。
上述仅以第三数据特征向量为例进行说明,并不起限定作用。
在一个示例中,基于上述确定的聚合值,电子设备对得到的多个数据类进行合并处理的过程可以包括如下步骤。
电子设备计算第一中心向量与多个数据类中除第一数据类外的任意一个数据类的第二中心向量之间的距离值,得到多个距离值。即电子设备计算第一中心向量与第二中心向量之间的距离值,得到多个距离值。第二中心向量为:多个数据类中除第一数据类外的任意一个数据类的中心向量。
从得到的多个距离值中,电子设备获取最小的距离值,并确定最小的距离值所表征的第二数据类。
电子设备获取多个数据类中每个数据类包括的数据特征向量的第三聚合值。这里,电子设备可以得到多个第三聚合值。
另外,当电子设备将第一数据类和第二数据类作为合并数据类时,也就是,电子设备将第一数据类和第二数据类作为一个数据类时,获取合并数据类包括的数据特征向量的第四聚合值,并获取多个数据类中除合并数据类外的每个数据类包括的数据特征向量的第四聚合值。这里,电子设备可以得到多个第四聚合值。
电子设备对多个第三聚合值进行累加处理,得到第三和值。电子设备对多个第四聚合值进行累加处理,得到第四和值。这里,多个数据类中所有数据特征向量的聚合值的和值,用于评价聚类效果的好坏。
当第四和值小于第三和值时,电子设备确定将第一数据类和第二数据类合并后的聚类效果更好,对第一数据类和第二数据类进行合并处理。
之后,电子设备重新计算第一数据类的中心向量与多个数据类中除第一数据类外的任意一个数据类的中心向量之间的距离值,确定得到的多个距离值中最小距离值所表征的第二数据类,对第一数据类和第二数据类进行合并处理,直至第四和值不小于第三和值为止。
在一个示例中,为了获得较好地聚类效果,电子设备还可以采用以下步骤获得聚合值。
如上述电子设备确定第一距离均值,以及确定多个第二距离均值中的距离均值最小值的过程。电子 设备在计算得到距离均值最小值和第一距离均值的比值之后,将该比值减去1,得到结果为第三数据特征向量的聚合值。
在一个示例中,为了获得较好地聚类效果,电子设备还可以采用以下步骤获得聚合值。
如上述电子设备确定第一距离均值,以及确定多个第二距离均值中的距离均值最小值的过程。电子设备在计算得到第一距离均值和距离均值最小值的比值之后,将1减去该比值,得到结果为第三数据特征向量的聚合值。
可选地,基于1减去比值得到的聚合值,或基于比值减去1所得到的聚合值,电子设备对得到的多个数据类进行合并处理的过程可以包括如下步骤。
电子设备计算第一中心向量与多个数据类中除第一数据类外的任意一个数据类的第二中心向量之间的距离值,得到多个距离值。即电子设备计算第一中心向量与第二中心向量之间的距离值,得到多个距离值。第二中心向量为:多个数据类中除第一数据类外的任意一个数据类的中心向量。
从得到的多个距离值中,电子设备获取最小的距离值,确定最小的距离值所表征的第二数据类。
电子设备获取多个数据类中,每个数据类包括的数据特征向量的第三聚合值。这里,电子设备可以得到多个第三聚合值。
另外,当电子设备将第一数据类和第二数据类作为合并数据类时,也就是,电子设备将第一数据类和第二数据类作为一个数据类时,获取合并数据类包括的数据特征向量的第四聚合值,并获取多个数据类中除合并数据类外的每个数据类包括的数据特征向量的第四聚合值。这里,电子设备可以得到多个第四聚合值。
电子设备对多个第三聚合值进行累加处理,得到第三和值。电子设备对多个第四聚合值进行累加处理,得到第四和值。
当第四和值大于第三和值时,电子设备确定将第一数据类和第二数据类合并后的聚类效果更好,对第一数据类和第二数据类进行合并处理。
之后,电子设备重新计算第一数据类的中心向量与多个数据类中除第一数据类外的任意一个数据类的中心向量之间的距离值,确定得到的多个距离值中最小距离值所表征的第二数据类,对第一数据类和第二数据类进行合并处理,直至第四和值不大于第三和值为止。
可选地,为了提高电子设备对异常用户识别的准确性,电子设备预先存储了第一数据类的距离值范围。距离值范围用于对数据类中的数据特征向量与数据类的中心向量之间的距离值进行限制。
在一种实现方式中,电子设备可以采用以下方式确定第一数据类的距离值范围。
电子设备计算第一中心向量分别与第一数据类包括的每个数据特征向量的距离值,得到多个距离值。
电子设备计算多个距离值的距离均值,作为第三距离均值。电子设备还计算多个距离值的标准差,作为第一标准差。电子设备根据第三距离均值和第一标准差,可以构建正态分布曲线。该正态分布曲线:用于表征第一中心向量与第一数据类包括的数据特征向量之间的距离值分布。
基于正态分布曲线,电子设备根据第三距离均值和第一标准差,确定第一边界值和第二边界值。其中,第一边界值小于第三距离均值,第一边界值与第三距离均值的差的绝对值为:预设倍数的第一标准差。第二边界值大于第三距离均值,第二边界值与第三距离均值的差的绝对值同样为:预设倍数的第一标准差。
电子设备将第一边界值和第二边界值组成的区间,确定为第一数据类的距离值范围。
在一个例子中,预设倍数为3。此时,电子设备基于3倍标准差,确定第一数据类的距离值范围,如图5所示。图5中,μ
1为第三距离均值,s为第一标准差,距离值范围为μ
1-3s~μ
1+3s。
在正态分布曲线中,与第三距离均值μ
1的距离大于3倍标准差的数据属于小概率事件,其表征不可能发生的事件。若第二数据特征向量与第一中心向量之间的距离值未在距离值范围内,则电子设备可以认定待识别用户为异常用户。
但是,在实际应用中,第一中心向量与第一数据类包括的数据特征向量的距离值分布不一定符合正态分布。另一种实现方式中,电子设备可以采用以下方式确定第一数据类的距离值范围。
电子设备计算第一中心向量分别与第一数据类包括的每个数据特征向量的距离值,得到多个距离值。
电子设备根据预设的对数函数,计算多个距离值中每个距离值的对数值。电子设备还计算多个对数值的均值,作为对数均值。电子设备还计算得到的多个对数值的标准差,作为第二标准差。电子设备根据对数均值和第二标准差,可以构建正态分布曲线。该正态分布曲线:用于表征第一中心向量与第一数据类中的数据特征向量之间的距离值的对数分布。
基于正态分布曲线,电子设备根据对数均值和第二标准差,确定第三边界值和第四边界值。其中,第三边界值小于对数均值,第三边界值与对数均值的差的绝对值为:预设倍数的第二标准差。第四边界值大于对数均值,第二边界值与对数均值的差的绝对值同样为:预设倍数的第二标准差。
电子设备根据预设的对数函数的反函数,计算第三边界值的反对数值作为第一反对数值,计算第四边界值的反对数值作为第二反对数值。例如,预设的对数函数为y=log
10x,则预设的对数函数的反函数为x=10
y。
电子设备将第一反对数值和第二反对数值组成的区间,确定为第一数据类的距离值范围。
可选地,在一种实现方式中,为了提高电子设备识别异常用户的准确性,根据多个用户行为维度中的每个用户行为维度,电子设备判断第二数据特征向量对应的数据特征值是否超过预设的特征基线值。一个例子中,每个用户行为维度,电子设备预设有一个特征基线值。此处,第二数据特征向量即为差异特征向量。
如果第二数据特征向量对应的数据特征值超过特征基线值,则电子设备可确定在用户行为维度下所表征的用户行为为异常用户行为,并确定待识别用户为异常用户。
如果第二数据特征向量对应的数据特征值未超过特征基线值,则电子设备可确定在用户行为维度下所表征的用户行为为正常用户行为。若第二数据特征向量对应的所有数据特征值均未超过特征基线值,则电子设备确定待识别用户为正常用户。
本公开实施例中,对于数据特征值差异性较小的用户行为维度,例如,用户行为维度1,电子设备可以直接确定用户行为维度1的特征基线值。例如,用户切换MAC地址频率一般为1天为1次或2次,此时,电子设备可以确定MAC地址切换频率这个用户行为维度的特征基线值为2。
对于数据特征值差异性较大的用户行为维度,例如,用户行为维度2,电子设备统计多个用户行为数据在用户行为维度2下的数据特征值的概率密度分布。电子设备根据概率密度分布,确定用户行为维度2的特征基线值。
基于相同的发明构思,根据上述异常用户识别方法,本公开实施例还提供了一种异常用户识别装置。参考图8,图8为本公开实施例提供的异常用户识别装置的第一种结构示意图,该装置包括:获取单元801、提取单元802、第一确定单元803、聚类单元804、第二确定单元805和第三确定单元806。
获取单元801,用于获取用户的用户行为数据;
提取单元802,用于提取用户行为数据在预设的多个行为维度下的多个特征值;
第一确定单元803,用于根据多个特征值,确定用户行为数据对应的特征向量;
聚类单元804,用于通过预设的聚类算法,对特征向量进行聚类处理,得到多个聚合类,并获取每个聚合类的中心向量;
第二确定单元805,用于确定差异特征向量,差异特征向量与所属聚合类的中心向量之间的距离值未在预设距离值范围内;
第三确定单元806,用于将差异特征向量所表征的用户确定为异常用户。
本公开实施例提供的异常用户识别装置中,电子设备通过对特征向量进行聚类处理,获取与聚合类的中心向量的距离未在预设距离值范围内的差异特征向量。电子设备根据获取的差异特征向量实现对异常用户的识别。管理人员无需再将限制的用户名加入黑名单中,电子设备也无需再通过建立黑名单的方式对异常用户进行识别,实现了对管理人员未知且无法发现异常行为的用户进行识别。
在一个示例中,上述用户为多个用户。
此时,获取单元801,具体可以用于获取多个用户的用户行为数据;
提取单元802,具体可以用于提取多个用户中每个用户的用户行为数据在预设的多个用户行为维度下的多个用户特征值;
第一确定单元803,具体可以用于根据多个用户中每个用户的多个用户特征值,确定多个用户中每个用户的用户特征向量;
聚类单元804,具体可以用于通过预设的聚类算法,对多个用户的用户特征向量进行聚类处理,得到多个用户类,根据多个用户类中每个用户类包括的用户特征向量,确定多个用户类中每个用户类的中心向量。
在一个示例中,聚类单元804,具体可以用于:
通过K-means聚类算法,对多个用户的用户特征向量进行聚类处理,得到K个初始用户类;K为正整数;
获取K个初始用户类中第一初始用户类和第二初始用户类;
对第一初始用户类与第二初始用户类进行合并处理,得到合并初始用户类;
将合并初始用户类和K个初始用户类中未合并的其他初始用户类,分别作为进行聚类处理后的用户类,得到多个用户类;
第一初始用户类为:K个初始用户类中,包括的用户特征向量的个数小于预设数量阈值的初始用户类;
第二初始用户类为:K个初始用户类中,与第一初始用户类的中心向量的距离值最小的中心向量所表征的初始用户类。
在一个示例中,聚类单元804,还可以用于:
计算多个用户类中任意两个用户类的中心向量之间的距离值,得到多个距离值;
确定多个距离值中最小距离值所表征的第一用户类和第二用户类;
获取多个用户类中每个用户类包括的用户特征向量的第一聚合值;
当将第一用户类和第二用户类作为合并用户类时,获取合并用户类包括的用户特征向量的第二聚合 值,并获取多个用户类中,除合并用户类外的每个用户类包括的用户特征向量的第二聚合值;
对多个第一聚合值进行累加处理,得到第一和值;
对多个第二聚合值进行累加处理,得到第二和值;
当第二和值小于第一和值时,对第一用户类和第二用户类进行合并处理;
其中,聚合值用于表征用户特征向量归属于用户类中的合理程度。
在一个示例中,聚类单元804,还可以用于:
计算第一用户特征向量与每个第二用户特征向量之间的第一距离值;第二用户特征向量为:第一用户特征向量所在用户类中,除第一用户特征向量之外的用户特征向量;
计算第一用户特征向量与每个第三用户特征向量之间的第二距离值;其中,第三用户特征向量为:除第一用户特征向量所在用户类之外的每个用户类中的用户特征向量;
对多个第一距离值进行取均值处理,得到第一距离均值;
对多个属于同一用户类的第二距离值进行取均值处理,得到多个第二距离均值;
获取多个第二距离均值中的距离均值最小值;
将第一距离均值与距离均值最小值的比值,作为第一用户特征向量的聚合值。
在一个示例中,第三确定单元806,具体可以用于:
根据多个用户行为维度中的每个用户行为维度,判断差异特征向量对应的用户特征值是否超过预设的特征基线值;
如果差异特征向量对应的用户特征值超过特征基线值,则确定在用户行为维度下所表征的用户行为为异常用户行为,并确定差异特征向量所表征的用户为异常用户。
在一个示例中,上述用户为一个用户,用户行为数据可以包括:所述用户的至少一个历史用户行为数据和一个当前用户行为数据。
此时,获取单元801,具体可以用于获取待识别用户的多个用户行为数据,多个用户行为数据包括:至少一个历史用户行为数据和一个当前用户行为数据;
提取单元802,具体可以用于提取至少一个历史用户行为数据中,每个历史用户行为数据在预设的多个行为维度下的多个第一数据特征值,并提取当前用户行为数据在多个行为维度下的多个第二数据特征值;
第一确定单元803,具体可以用于根据多个第一数据特征值,确定至少一个历史用户行为数据中,每个历史用户行为数据的第一数据特征向量,并根据多个第二数据特征值,确定当前用户行为数据的第二数据特征向量;
聚类单元804,具体可以用于通过预设的聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到多个数据类;确定第二数据特征向量所属的第一数据类的中心向量;
第二确定单元805,用于判断第二数据特征向量与第一数据类的中心向量之间的距离值是否在预设距离值范围内;若否,则确定第二数据特征向量为差异特征向量。
在一个示例中,聚类单元804,具体可以用于:
通过K-means聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到K个初始数据类;K为正整数;
获取K个初始数据类中的第一初始数据类,第一初始数据类包括N个数据特征向量,N为正整数;
若N小于预设数量阈值,则获取K个初始数据类中的第二初始数据类;
对第一初始数据类与第二初始数据类进行合并处理,得到合并初始数据类;
将合并初始数据类和K个初始数据类中未合并的其他初始数据类,分别作为进行聚类处理后的数据类,得到多个数据类;
第一初始数据类为:第二数据特征向量所属的初始数据类;
第二初始数据类为:所述K个初始数据类中,与第一初始数据类的中心向量的距离值最小的中心向量所表征的初始数据类。
在一个示例中,聚类单元804,还可以用于:
计算第一中心向量与多个数据类中除第一数据类外的任意一个数据类的第二中心向量之间的距离值,得到多个距离值;
确定多个距离值中最小距离值所表征的第二数据类;
获取多个数据类中每个数据类包括的数据特征向量的第三聚合值;
当将第一数据类和第二数据类作为合并数据类时,获取合并数据类包括的数据特征向量的第四聚合值,并获取多个数据类中除合并数据类外的每个数据类包括的数据特征向量的第四聚合值;
对多个第三聚合值进行累加处理,得到第三和值;
对多个第四聚合值进行累加处理,得到第四和值;
当第四和值小于第三和值时,对第一数据类和第二数据类进行合并处理;
其中,聚合值用于表征数据特征向量归属于数据类中的合理程度。
在一个示例中,聚类单元804,还可以用于:
计算第三数据特征向量与每个第四数据特征向量之间的第一距离值;第四数据特征向量为:第三数据特征向量所在数据类中,除第三数据特征向量之外的数据特征向量;
计算第三数据特征向量与每个第五数据特征向量之间的第二距离值;第五数据特征向量为:除第三数据特征向量所在数据类之外的每个数据类中的数据特征向量;
对多个第一距离值进行取均值处理,得到第一距离均值;
对多个属于同一用户类的第二距离值进行取均值处理,得到多个第二距离均值;
获取多个第二距离均值中的距离均值最小值;
将第一距离均值与距离均值最小值的比值,作为第三数据特征向量的聚合值。
在一个示例中,第三确定单元806,具体可以用于:
根据多个用户行为维度中的每个用户行为维度,判断第二数据特征向量对应的数据特征值是否超过预设的特征基线值;其中,第二数据特征向量为差异特征向量;
如果第二数据特征向量对应的数据特征值超过特征基线值,则确定在用户行为维度下所表征的用户行为为异常用户行为,并确定待识别用户为异常用户。
基于相同的发明构思,根据上述异常用户识别方法,本公开实施例还提供了一种电子设备,如图9所示,包括处理器901和机器可读存储介质902,机器可读存储介质902存储有能够被处理器901执行的机器可执行指令。机器可执行指令促使处理器901:
获取用户的用户行为数据;
提取用户行为数据在预设的多个行为维度下的多个特征值;
根据多个特征值,确定用户行为数据对应的特征向量;
通过预设的聚类算法,对特征向量进行聚类处理,得到多个聚合类,并获取每个聚合类的中心向量;
确定差异特征向量,差异特征向量与所属聚合类的中心向量之间的距离值未在预设距离值范围内;
将差异特征向量所表征的用户确定为异常用户。
本公开实施例提供的电子设备中,电子设备通过对特征向量进行聚类处理,获取与聚合类的中心向量的距离未在预设距离值范围内的差异特征向量。电子设备根据获取的差异特征向量实现对异常用户的识别。管理人员无需再将限制的用户名加入黑名单中,电子设备也无需再通过建立黑名单的方式对异常用户进行识别,实现了对管理人员未知且无法发现异常行为的用户进行识别。
在一个示例中,当用户为多个用户时,机器可执行指令具体可以促使处理器901:
提取多个用户中每个用户的用户行为数据在多个用户行为维度下的多个用户特征值;
根据多个用户中每个用户的多个用户特征值,确定多个用户中每个用户的用户特征向量;
通过预设的聚类算法,对多个用户的用户特征向量进行聚类处理,得到多个用户类;
根据多个用户类中每个用户类包括的用户特征向量,确定多个用户类中每个用户类的中心向量。
在一个示例中,机器可执行指令具体可以促使处理器901:
通过K-means聚类算法,对多个用户的用户特征向量进行聚类处理,得到K个初始用户类;K为正整数;
获取K个初始用户类中第一初始用户类和第二初始用户类;
对第一初始用户类与第二初始用户类进行合并处理,得到合并初始用户类;
将合并初始用户类和K个初始用户类中未合并的其他初始用户类,分别作为进行聚类处理后的用户类,得到多个用户类;
第一初始用户类为K个初始用户类中包括的用户特征向量的个数小于预设数量阈值的初始用户类;
第二初始用户类为K个初始用户类中与第一初始用户类的中心向量的距离值最小的中心向量所表征的初始用户类。
在一个示例中,机器可执行指令还可以促使处理器901:
计算多个用户类中任意两个用户类的中心向量之间的距离值,得到多个距离值;
确定多个距离值中最小距离值所表征的第一用户类和第二用户类;
获取多个用户类中每个用户类包括的用户特征向量的第一聚合值;
当将第一用户类和第二用户类作为合并用户类时,获取合并用户类包括的用户特征向量的第二聚合值,并获取多个用户类中,除合并用户类外的每个用户类包括的用户特征向量的第二聚合值;
对多个第一聚合值进行累加处理,得到第一和值;
对多个第二聚合值进行累加处理,得到第二和值;
当第二和值小于第一和值时,对第一用户类和第二用户类进行合并处理;
其中,聚合值用于表征用户特征向量归属于用户类中的合理程度。
在一个示例中,机器可执行指令还可以促使处理器901:
计算第一用户特征向量与每个第二用户特征向量之间的第一距离值;第二用户特征向量为:第一用户特征向量所在用户类中,除第一用户特征向量之外的用户特征向量;
计算第一用户特征向量与每个第三用户特征向量之间的第二距离值;第三用户特征向量为:除第一 用户特征向量所在用户类之外的每个用户类中的用户特征向量;
对多个第一距离值进行取均值处理,得到第一距离均值;
对多个属于同一用户类的第二距离值进行取均值处理,得到多个第二距离均值;
获取多个第二距离均值中的距离均值最小值;
将第一距离均值与距离均值最小值的比值,作为第一用户特征向量的聚合值。
在一个示例中,机器可执行指令具体可以促使处理器901:
根据多个用户行为维度中的每个用户行为维度,判断差异特征向量对应的用户特征值是否超过预设的特征基线值;
如果差异特征向量对应的用户特征值超过特征基线值,则确定在用户行为维度下所表征的用户行为为异常用户行为,并确定差异特征向量所表征的用户为异常用户。
在一个示例中,当所述用户为一个用户时,所述用户行为数据包括:所述用户的至少一个历史用户行为数据和一个当前用户行为数据;
机器可执行指令具体可以促使处理器901:
提取所述至少一个历史用户行为数据中每个历史用户行为数据在多个行为维度下的多个第一数据特征值,并提取所述当前用户行为数据在多个行为维度下的多个第二数据特征值;
根据所述多个第一数据特征值,确定每个历史用户行为数据的第一数据特征向量,并根据所述多个第二数据特征值,确定所述当前用户行为数据的第二数据特征向量;
通过预设的聚类算法,对所述多个第一数据特征向量和所述第二数据特征向量进行聚类处理,得到多个数据类;
确定所述第二数据特征向量所属的第一数据类的中心向量;
判断所述第二数据特征向量与所述第一数据类的中心向量之间的距离值是否在预设距离值范围内;
若否,则确定所述第二数据特征向量为差异特征向量。
在一个示例中,机器可执行指令具体可以促使处理器901:
通过K-means聚类算法,对多个第一数据特征向量和第二数据特征向量进行聚类处理,得到K个初始数据类;K为正整数;
获取K个初始数据类中的第一初始数据类,第一初始数据类包括N个数据特征向量,N为正整数;
若N小于预设数量阈值,则获取K个初始数据类中的第二初始数据类;
对第一初始数据类与第二初始数据类进行合并处理,得到合并初始数据类;
将合并初始数据类和K个初始数据类中未合并的其他初始数据类,分别作为进行聚类处理后的数据类,得到多个数据类;
第一初始数据类为:第二数据特征向量所属的初始数据类;
第二初始数据类为:K个初始数据类中,与第一初始数据类的中心向量的距离值最小的中心向量所表征的初始数据类。
在一个示例中,机器可执行指令还可以促使处理器901:
计算第一中心向量与多个数据类中除第一数据类外的任意一个数据类的第二中心向量之间的距离值,得到多个距离值;
确定多个距离值中最小距离值所表征的第二数据类;
获取多个数据类中每个数据类包括的数据特征向量的第三聚合值;
当将第一数据类和第二数据类作为合并数据类时,获取合并数据类包括的数据特征向量的第四聚合值,并获取多个数据类中除合并数据类外的每个数据类包括的数据特征向量的第四聚合值;
对多个第三聚合值进行累加处理,得到第三和值;
对多个第四聚合值进行累加处理,得到第四和值;
当第四和值小于第三和值时,对第一数据类和第二数据类进行合并处理;
其中,聚合值用于表征数据特征向量归属于数据类中的合理程度。
在一个示例中,机器可执行指令还可以促使处理器901:
计算第三数据特征向量与每个第四数据特征向量之间的第一距离值;第四数据特征向量为:第三数据特征向量所在数据类中,除第三数据特征向量之外的数据特征向量;
计算第三数据特征向量与每个第五数据特征向量之间的第二距离值;第五数据特征向量为:除第三数据特征向量所在数据类之外的每个数据类中的数据特征向量;
对多个第一距离值进行取均值处理,得到第一距离均值;
对多个属于同一用户类的第二距离值进行取均值处理,得到多个第二距离均值;
获取多个第二距离均值中的距离均值最小值;
将第一距离均值与距离均值最小值的比值,作为第三数据特征向量的聚合值。
在一个示例中,机器可执行指令具体可以促使处理器901:
根据多个用户行为维度中的个用户行为维度,判断第二数据特征向量对应的数据特征值是否超过预设的特征基线值;其中,第二数据特征向量为差异特征向量;
如果第二数据特征向量对应的数据特征值超过特征基线值,则确定在用户行为维度下所表征的用户行为为异常用户行为,并确定待识别用户为异常用户。
另外,如图9所示,电子设备还可以包括:通信接口903和通信总线904;其中,处理器901、机器可读存储介质902、通信接口903通过通信总线904完成相互间的通信,通信接口903用于上述电子设备与其他设备之间的通信。
上述通信总线可以是外设部件互连标准(英文:Peripheral Component Interconnect,简称:PCI)总线或扩展工业标准结构(英文:Extended Industry Standard Architecture,简称:EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。
上述机器可读存储介质可以包括随机存取存储器(英文:Random Access Memory,简称:RAM),也可以包括非易失性存储器(英文:Non-Volatile Memory,简称:NVM),例如至少一个磁盘存储器。另外,机器可读存储介质还可以是至少一个位于远离前述处理器的存储装置。
上述处理器可以是通用处理器,包括中央处理器(英文:Central Processing Unit,简称:CPU)、网络处理器(英文:Network Processor,简称:NP)等;还可以是数字信号处理器(英文:Digital Signal Processing,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)、现场可编程门阵列(英文:Field-Programmable Gate Array,简称:FPGA)或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
基于相同的发明构思,根据上述异常用户识别方法,本公开实施例还提供了一种机器可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图1-7所示的 任一异常用户识别方法步骤。
基于相同的发明构思,根据上述异常用户识别方法,本公开实施例还提供了一种机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图1-7所示的任一异常用户识别方法步骤。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于异常用户识别装置、电子设备、机器可读存储介质实施例而言,由于其基本相似于异常用户识别方法实施例,所以描述的比较简单,相关之处参见异常用户识别方法实施例的部分说明即可。
以上所述仅为本公开的较佳实施例而已,并非用于限定本公开的保护范围。凡在本公开的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本公开的保护范围内。
Claims (15)
- 一种异常用户识别方法,所述方法包括:获取用户的用户行为数据;提取所述用户行为数据在预设的多个行为维度下的多个特征值;根据所述多个特征值,确定所述用户行为数据对应的特征向量;通过预设的聚类算法,对所述特征向量进行聚类处理,得到多个聚合类,并获取每个聚合类的中心向量;确定差异特征向量,所述差异特征向量与所属聚合类的中心向量之间的距离值未在预设距离值范围内;将所述差异特征向量所表征的用户确定为异常用户。
- 根据权利要求1所述的方法,当所述用户为多个用户时,所述提取所述用户行为数据在预设的多个行为维度下的多个特征值,包括:提取每个用户的用户行为数据在多个用户行为维度下的多个用户特征值;所述根据所述多个特征值,确定所述用户行为数据对应的特征向量,包括:根据所述多个用户中每个用户的多个用户特征值,确定所述多个用户中每个用户的用户特征向量;所述通过预设的聚类算法,对所述特征向量进行聚类处理,得到多个聚合类,并获取每个聚合类的中心向量,包括:通过预设的聚类算法,对所述多个用户的用户特征向量进行聚类处理,得到多个用户类;根据所述多个用户类中每个用户类包括的用户特征向量,确定所述多个用户类中每个用户类的中心向量。
- 根据权利要求2所述的方法,所述通过预设的聚类算法,对所述多个用户的用户特征向量进行聚类处理,得到多个用户类,包括:通过K-means聚类算法,对所述多个用户的用户特征向量进行聚类处理,得到K个初始用户类;所述K为正整数;获取所述K个初始用户类中第一初始用户类和第二初始用户类;对所述第一初始用户类与所述第二初始用户类进行合并处理,得到合并初始用户类;将所述合并初始用户类和所述K个初始用户类中未合并的其他初始用户类,分别作为进行聚类处理后的用户类,得到多个用户类;所述第一初始用户类为所述K个初始用户类中包括的用户特征向量的个数小于预设数量阈值的初始用户类;所述第二初始用户类为所述K个初始用户类中的初始用户类,该初始用户类的中心向量与所述第一初始用户类的中心向量的距离值最小。
- 根据权利要求3所述的方法,所述方法还包括:计算所述多个用户类中任意两个用户类的中心向量之间的距离值,得到多个距离值;确定所述多个距离值中最小距离值所表征的第一用户类和第二用户类;获取所述多个用户类中每个用户类包括的用户特征向量的第一聚合值;当将所述第一用户类和所述第二用户类作为合并用户类时,获取所述合并用户类包括的用户特征向 量的第二聚合值,并获取所述多个用户类中除所述合并用户类外的每个用户类包括的用户特征向量的第二聚合值;对多个第一聚合值进行累加处理,得到第一和值;对多个第二聚合值进行累加处理,得到第二和值;当所述第二和值小于所述第一和值时,对所述第一用户类和所述第二用户类进行合并处理;所述聚合值用于表征用户特征向量归属于用户类的合理程度。
- 根据权利要求1所述的方法,当所述用户为一个用户时,所述用户行为数据包括:所述用户的至少一个历史用户行为数据和一个当前用户行为数据;所述提取所述用户行为数据在预设的多个用户行为维度下的多个特征值,包括:提取每个历史用户行为数据在多个用户行为维度下的多个第一数据特征值,并提取所述当前用户行为数据在多个用户行为维度下的多个第二数据特征值;所述根据所述多个特征值,确定所述用户行为数据对应的特征向量,包括:根据所述多个第一数据特征值,确定每个历史用户行为数据的第一数据特征向量,并根据所述多个第二数据特征值,确定所述当前用户行为数据的第二数据特征向量;所述通过预设的聚类算法,对所述特征向量进行聚类处理,得到多个聚合类,并获取每个聚合类的中心向量,包括:通过预设的聚类算法,对所述多个第一数据特征向量和所述第二数据特征向量进行聚类处理,得到多个数据类;确定所述第二数据特征向量所属的第一数据类的中心向量;所述确定差异特征向量,包括:判断所述第二数据特征向量与所述第一数据类的中心向量之间的距离值是否在预设距离值范围内;若距离值不在预设距离值范围内,则确定所述第二数据特征向量为差异特征向量。
- 根据权利要求5所述的方法,所述通过预设的聚类算法,对所述多个第一数据特征向量和所述第二数据特征向量进行聚类处理,得到多个数据类,包括:通过K-means聚类算法,对所述多个第一数据特征向量和所述第二数据特征向量进行聚类处理,得到K个初始数据类;所述K为正整数;获取所述K个初始数据类中的第一初始数据类,所述第一初始数据类包括N个数据特征向量,所述N为正整数;若N小于预设数量阈值,则获取所述K个初始数据类中的第二初始数据类;对所述第一初始数据类与所述第二初始数据类进行合并处理,得到合并初始数据类;将所述合并初始数据类和所述K个初始数据类中未合并的其他初始数据类,分别作为进行聚类处理后的数据类,得到多个数据类;所述第一初始数据类为所述第二数据特征向量所属的初始数据类;所述第二初始数据类为所述K个初始数据类中与所述第一初始数据类的中心向量的距离值最小的中心向量所表征的初始数据类。
- 根据权利要求6所述的方法,所述方法还包括:计算所述第一中心向量第二中心向量之间的距离值,得到多个距离值,其中,所述第二中心向量是所述多个数据类中除所述第一数据类外的任意一个数据类的中心向量;确定所述多个距离值中最小距离值所表征的第二数据类;获取所述多个数据类中每个数据类包括的数据特征向量的第三聚合值;当将所述第一数据类和所述第二数据类作为合并数据类时,获取所述合并数据类包括的数据特征向量的第四聚合值,并获取所述多个数据类中除所述合并数据类外的每个数据类包括的数据特征向量的第四聚合值;对多个第三聚合值进行累加处理,得到第三和值;对多个第四聚合值进行累加处理,得到第四和值;当所述第四和值小于所述第三和值时,对所述第一数据类和所述第二数据类进行合并处理;所述聚合值用于表征数据特征向量归属于数据类的合理程度。
- 根据权利要求2或5所述的方法,所述确定所述差异特征向量所表征的用户为异常用户,包括:根据所述多个用户行为维度中的每个用户行为维度,判断所述差异特征向量对应的数据特征值是否超过预设的特征基线值;如果所述差异特征向量对应的数据特征值超过所述特征基线值,则确定在所述用户行为维度下所表征的用户行为为异常用户行为,并确定所述差异特征向量所表征的用户为异常用户。
- 一种电子设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;所述机器可执行指令促使所述处理器;获取用户的用户行为数据;提取所述用户行为数据在预设的多个行为维度下的多个特征值;根据所述多个特征值,确定所述用户行为数据对应的特征向量;通过预设的聚类算法,对所述特征向量进行聚类处理,得到多个聚合类,并获取每个聚合类的中心向量;确定差异特征向量,所述差异特征向量与所属聚合类的中心向量之间的距离值未在预设距离值范围内;将所述差异特征向量所表征的用户确定为异常用户。
- 根据权利要求9所述的电子设备,当所述用户为多个用户时,所述机器可执行指令具体促使所述处理器:提取所述多个用户中每个用户的用户行为数据在多个用户行为维度下的多个用户特征值;根据所述多个用户中每个用户的多个用户特征值,确定所述多个用户中每个用户的用户特征向量;通过预设的聚类算法,对所述多个用户的用户特征向量进行聚类处理,得到多个用户类;根据所述多个用户类中每个用户类包括的用户特征向量,确定所述多个用户类中每个用户类的中心向量。
- 根据权利要求10所述的电子设备,所述机器可执行指令具体促使所述处理器:通过K-means聚类算法,对所述多个用户的用户特征向量进行聚类处理,得到K个初始用户类;所述K为正整数;获取所述K个初始用户类中第一初始用户类和第二初始用户类;对所述第一初始用户类与所述第二初始用户类进行合并处理,得到合并初始用户类;将所述合并初始用户类和所述K个初始用户类中未合并的其他初始用户类,分别作为进行聚类处 理后的用户类,得到多个用户类;所述第一初始用户类为所述K个初始用户类中包括的用户特征向量的个数小于预设数量阈值的初始用户类;所述第二初始用户类为所述K个初始用户类中与所述第一初始用户类的中心向量的距离值最小的中心向量所表征的初始用户类。
- 根据权利要求9所述的电子设备,当所述用户为一个用户时,所述用户行为数据包括:所述用户的至少一个历史用户行为数据和一个当前用户行为数据;所述机器可执行指令具体促使所述处理器:提取每个历史用户行为数据在多个行为维度下的多个第一数据特征值,并提取所述当前用户行为数据在多个行为维度下的多个第二数据特征值;根据所述多个第一数据特征值,确定每个历史用户行为数据的第一数据特征向量,并根据所述多个第二数据特征值,确定所述当前用户行为数据的第二数据特征向量;通过预设的聚类算法,对所述多个第一数据特征向量和所述第二数据特征向量进行聚类处理,得到多个数据类;确定所述第二数据特征向量所属的第一数据类的中心向量;判断所述第二数据特征向量与所述第一数据类的中心向量之间的距离值是否在预设距离值范围内;若距离值不在预设距离值范围内,则确定所述第二数据特征向量为差异特征向量。
- 根据权利要求12所述的电子设备,所述机器可执行指令具体促使所述处理器:通过K-means聚类算法,对所述多个第一数据特征向量和所述第二数据特征向量进行聚类处理,得到K个初始数据类;所述K为正整数;获取所述K个初始数据类中的第一初始数据类,所述第一初始数据类包括N个数据特征向量;所述N为正整数;若N小于预设数量阈值,则获取所述K个初始数据类中的第二初始数据类;对所述第一初始数据类与所述第二初始数据类进行合并处理,得到合并初始数据类;将所述合并初始数据类和所述K个初始数据类中未合并的其他初始数据类,分别作为进行聚类处理后的数据类,得到多个数据类;所述第一初始数据类为所述第二数据特征向量所属的初始数据类;所述第二初始数据类为所述K个初始数据类中与所述第一初始数据类的中心向量的距离值最小的中心向量所表征的初始数据类。
- 根据权利要求10或12所述的电子设备,所述机器可执行指令具体促使所述处理器:根据所述多个用户行为维度中的每一用户行为维度,判断所述差异特征向量对应的数据特征值是否超过预设的特征基线值;如果所述差异特征向量对应的数据特征值超过所述特征基线值,则确定在所述用户行为维度下所表征的用户行为为异常用户行为,并确定所述差异特征向量所表征的用户为异常用户。
- 一种机器可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,所述机器可执行指令促使所述处理器实现权利要求1-8任一项所述的方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP19803060.3A EP3771168B1 (en) | 2018-05-14 | 2019-05-09 | Abnormal user identification method |
| US17/049,563 US11671434B2 (en) | 2018-05-14 | 2019-05-09 | Abnormal user identification |
| JP2020563918A JP7125514B2 (ja) | 2018-05-14 | 2019-05-09 | 異常ユーザーの識別方法、電子機器及び機械可読記憶媒体 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810457994.8 | 2018-05-14 | ||
| CN201810457994.8A CN109861953B (zh) | 2018-05-14 | 2018-05-14 | 一种异常用户识别方法及装置 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019218927A1 true WO2019218927A1 (zh) | 2019-11-21 |
Family
ID=66889595
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2019/086232 Ceased WO2019218927A1 (zh) | 2018-05-14 | 2019-05-09 | 异常用户识别 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US11671434B2 (zh) |
| EP (1) | EP3771168B1 (zh) |
| JP (1) | JP7125514B2 (zh) |
| CN (1) | CN109861953B (zh) |
| WO (1) | WO2019218927A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111259962A (zh) * | 2020-01-17 | 2020-06-09 | 中南大学 | 一种针对时序社交数据的Sybil账号检测方法 |
| CN118132383A (zh) * | 2024-03-22 | 2024-06-04 | 北京衡石科技有限公司 | 业务数据监控方法、装置、电子设备和计算机可读介质 |
Families Citing this family (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112131320B (zh) * | 2019-06-25 | 2024-08-27 | 杭州海康威视数字技术股份有限公司 | 异常数据的检测方法、装置及存储介质 |
| CN110493176B (zh) * | 2019-07-02 | 2022-06-10 | 北京科东电力控制系统有限责任公司 | 一种基于非监督机器学习的用户可疑行为分析方法及系统 |
| CN110753065B (zh) * | 2019-10-28 | 2022-03-01 | 国网河南省电力公司信息通信公司 | 网络行为检测方法、装置、设备及存储介质 |
| CN110990810B (zh) * | 2019-11-28 | 2022-06-28 | 中国建设银行股份有限公司 | 一种用户操作数据处理方法、装置、设备及存储介质 |
| CN111259948A (zh) * | 2020-01-13 | 2020-06-09 | 中孚安全技术有限公司 | 一种基于融合机器学习算法的用户安全行为基线分析方法 |
| CN111625817B (zh) * | 2020-05-12 | 2023-05-02 | 咪咕文化科技有限公司 | 异常用户识别方法、装置、电子设备及存储介质 |
| CN113837512B (zh) * | 2020-06-23 | 2024-08-13 | 中国移动通信集团辽宁有限公司 | 异常用户的识别方法及装置 |
| CN112488246A (zh) * | 2020-08-06 | 2021-03-12 | 蔡淦祺 | 一种基于网络直播和在线电商带货的信息处理方法及系统 |
| CN112149749B (zh) * | 2020-09-29 | 2024-03-19 | 北京明朝万达科技股份有限公司 | 异常行为检测方法、装置、电子设备及可读存储介质 |
| CN112437091B (zh) * | 2020-11-30 | 2021-09-21 | 成都信息工程大学 | 一种面向主机社区行为的异常流量检测方法 |
| CN112766459B (zh) * | 2021-01-12 | 2024-05-03 | 合肥黎曼信息科技有限公司 | 一种基于生成器的异常检测方法 |
| CN113129054B (zh) * | 2021-03-30 | 2024-05-31 | 广州博冠信息科技有限公司 | 用户识别方法和装置 |
| CN113343056A (zh) * | 2021-05-21 | 2021-09-03 | 北京市燃气集团有限责任公司 | 一种用户用气量异常检测方法及装置 |
| CN114492647B (zh) * | 2022-01-28 | 2024-06-21 | 中国银联股份有限公司 | 基于分布式图嵌入的联邦图聚类方法、装置及可读存储介质 |
| CN114565784B (zh) * | 2022-03-15 | 2024-08-23 | 平安科技(深圳)有限公司 | 基于聚类算法的行人异常行为检测方法及装置、存储介质 |
| CN114862109B (zh) * | 2022-03-29 | 2026-02-13 | 广东电网有限责任公司 | 一种用电异常监测方法、装置、电子设备及存储介质 |
| CN116304763B (zh) * | 2023-05-18 | 2023-10-24 | 国网山东省电力公司日照供电公司 | 一种电力数据预分析方法、系统、设备及介质 |
| CN117132242B (zh) * | 2023-10-26 | 2024-01-23 | 北京点聚信息技术有限公司 | 一种电子印章身份权限安全管理方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104239324A (zh) * | 2013-06-17 | 2014-12-24 | 阿里巴巴集团控股有限公司 | 一种基于用户行为的特征提取、个性化推荐的方法和系统 |
| CN104268481A (zh) * | 2014-10-10 | 2015-01-07 | 中国联合网络通信集团有限公司 | 一种实现智能手机预警的方法及装置 |
| WO2016054988A1 (zh) * | 2014-10-09 | 2016-04-14 | 阿里巴巴集团控股有限公司 | 识别智能设备用户的方法和装置 |
| CN105681089A (zh) * | 2016-01-26 | 2016-06-15 | 上海晶赞科技发展有限公司 | 网络用户行为聚类方法、装置及终端 |
| CN106649517A (zh) * | 2016-10-17 | 2017-05-10 | 北京京东尚科信息技术有限公司 | 数据挖掘方法、装置及系统 |
Family Cites Families (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9495868B2 (en) | 2013-11-01 | 2016-11-15 | Here Global B.V. | Traffic data simulator |
| US20150235152A1 (en) * | 2014-02-18 | 2015-08-20 | Palo Alto Research Center Incorporated | System and method for modeling behavior change and consistency to detect malicious insiders |
| US10038703B2 (en) * | 2014-07-18 | 2018-07-31 | The Regents Of The University Of Michigan | Rating network security posture and comparing network maliciousness |
| CN105320702B (zh) * | 2014-08-04 | 2019-02-01 | Tcl集团股份有限公司 | 一种用户行为数据的分析方法、装置及智能电视 |
| CN104537380A (zh) * | 2014-12-30 | 2015-04-22 | 小米科技有限责任公司 | 聚类方法和装置 |
| US10061816B2 (en) * | 2015-05-11 | 2018-08-28 | Informatica Llc | Metric recommendations in an event log analytics environment |
| JP5946573B1 (ja) | 2015-08-05 | 2016-07-06 | 株式会社日立パワーソリューションズ | 異常予兆診断システム及び異常予兆診断方法 |
| US10505959B1 (en) * | 2015-12-10 | 2019-12-10 | Hewlett Packard Enterprise Development Lp | System and method directed to behavioral profiling services |
| US9979740B2 (en) | 2015-12-15 | 2018-05-22 | Flying Cloud Technologies, Inc. | Data surveillance system |
| CN105553998B (zh) * | 2015-12-23 | 2019-02-01 | 中国电子科技集团公司第三十研究所 | 一种网络攻击异常检测方法 |
| CN107181724B (zh) * | 2016-03-11 | 2021-02-12 | 华为技术有限公司 | 一种协同流的识别方法、系统以及使用该方法的服务器 |
| US10257211B2 (en) * | 2016-05-20 | 2019-04-09 | Informatica Llc | Method, apparatus, and computer-readable medium for detecting anomalous user behavior |
| CN107622072B (zh) | 2016-07-15 | 2021-08-17 | 阿里巴巴集团控股有限公司 | 一种针对网页操作行为的识别方法及服务器、终端 |
| CN107645533A (zh) * | 2016-07-22 | 2018-01-30 | 阿里巴巴集团控股有限公司 | 数据处理方法、数据发送方法、风险识别方法及设备 |
| KR102464390B1 (ko) * | 2016-10-24 | 2022-11-04 | 삼성에스디에스 주식회사 | 행위 분석 기반 이상 감지 방법 및 장치 |
| US20180211270A1 (en) * | 2017-01-25 | 2018-07-26 | Business Objects Software Ltd. | Machine-trained adaptive content targeting |
| US10645109B1 (en) * | 2017-03-31 | 2020-05-05 | Exabeam, Inc. | System, method, and computer program for detection of anomalous user network activity based on multiple data sources |
| US10341372B2 (en) * | 2017-06-12 | 2019-07-02 | International Business Machines Corporation | Clustering for detection of anomalous behavior and insider threat |
| US10701094B2 (en) * | 2017-06-22 | 2020-06-30 | Oracle International Corporation | Techniques for monitoring privileged users and detecting anomalous activities in a computing environment |
| US20190116193A1 (en) * | 2017-10-17 | 2019-04-18 | Yanlin Wang | Risk assessment for network access control through data analytics |
| EP3477906B1 (en) * | 2017-10-26 | 2021-03-31 | Accenture Global Solutions Limited | Systems and methods for identifying and mitigating outlier network activity |
-
2018
- 2018-05-14 CN CN201810457994.8A patent/CN109861953B/zh active Active
-
2019
- 2019-05-09 US US17/049,563 patent/US11671434B2/en active Active
- 2019-05-09 WO PCT/CN2019/086232 patent/WO2019218927A1/zh not_active Ceased
- 2019-05-09 JP JP2020563918A patent/JP7125514B2/ja active Active
- 2019-05-09 EP EP19803060.3A patent/EP3771168B1/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104239324A (zh) * | 2013-06-17 | 2014-12-24 | 阿里巴巴集团控股有限公司 | 一种基于用户行为的特征提取、个性化推荐的方法和系统 |
| WO2016054988A1 (zh) * | 2014-10-09 | 2016-04-14 | 阿里巴巴集团控股有限公司 | 识别智能设备用户的方法和装置 |
| CN104268481A (zh) * | 2014-10-10 | 2015-01-07 | 中国联合网络通信集团有限公司 | 一种实现智能手机预警的方法及装置 |
| CN105681089A (zh) * | 2016-01-26 | 2016-06-15 | 上海晶赞科技发展有限公司 | 网络用户行为聚类方法、装置及终端 |
| CN106649517A (zh) * | 2016-10-17 | 2017-05-10 | 北京京东尚科信息技术有限公司 | 数据挖掘方法、装置及系统 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3771168A4 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111259962A (zh) * | 2020-01-17 | 2020-06-09 | 中南大学 | 一种针对时序社交数据的Sybil账号检测方法 |
| CN118132383A (zh) * | 2024-03-22 | 2024-06-04 | 北京衡石科技有限公司 | 业务数据监控方法、装置、电子设备和计算机可读介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2021524091A (ja) | 2021-09-09 |
| JP7125514B2 (ja) | 2022-08-24 |
| US20210240822A1 (en) | 2021-08-05 |
| CN109861953A (zh) | 2019-06-07 |
| CN109861953B (zh) | 2020-08-21 |
| US11671434B2 (en) | 2023-06-06 |
| EP3771168A4 (en) | 2021-05-26 |
| EP3771168A1 (en) | 2021-01-27 |
| EP3771168B1 (en) | 2022-04-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019218927A1 (zh) | 异常用户识别 | |
| US11855968B2 (en) | Methods and systems for deep learning based API traffic security | |
| US11924230B2 (en) | Individual device response options from the monitoring of multiple devices | |
| US8205255B2 (en) | Anti-content spoofing (ACS) | |
| US10958657B2 (en) | Utilizing transport layer security (TLS) fingerprints to determine agents and operating systems | |
| US10296739B2 (en) | Event correlation based on confidence factor | |
| US8185930B2 (en) | Adjusting filter or classification control settings | |
| US20240320329A1 (en) | Machine Learning Model Adversarial Attack Monitoring | |
| US10965680B2 (en) | Authority management method and device in distributed environment, and server | |
| CN106302445B (zh) | 用于处理请求的方法和装置 | |
| US11089024B2 (en) | System and method for restricting access to web resources | |
| US11665188B1 (en) | System and method for scanning remote services to locate stored objects with malware | |
| US11190589B1 (en) | System and method for efficient fingerprinting in cloud multitenant data loss prevention | |
| CN109547427B (zh) | 黑名单用户识别方法、装置、计算机设备及存储介质 | |
| CN107592296A (zh) | 垃圾账户的识别方法和装置 | |
| US11361084B1 (en) | Identifying and protecting against a computer security threat while preserving privacy of individual client devices using differential privacy for text documents | |
| CN110837638B (zh) | 一种勒索软件的检测方法、装置、设备及存储介质 | |
| CN113489726B (zh) | 流量限制方法及设备 | |
| CN113196265A (zh) | 安全检测分析 | |
| US11308212B1 (en) | Adjudicating files by classifying directories based on collected telemetry data | |
| CN114095936A (zh) | 短信验证码请求方法、攻击防御方法、装置、介质及设备 | |
| CN115835215A (zh) | 消息防骚扰方法、系统、设备及存储介质 | |
| US20190356678A1 (en) | Network security tool | |
| CN121079686A (zh) | 数据检取控制 | |
| CN120915566A (zh) | 客户端异常检测方法、装置、设备、计算机介质及产品 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19803060 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2019803060 Country of ref document: EP Effective date: 20201021 |
|
| ENP | Entry into the national phase |
Ref document number: 2020563918 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |