Disclosure of Invention
In view of the technical defects of the traditional artificial characteristic-based method, the invention provides a detection method and a detection device based on dynamic behavior characteristic representation in order to better detect information reliability.
According to an aspect of the present invention, a method for detecting unreal information based on dynamic expression learning is provided, which includes the following steps:
acquiring information to be detected;
detecting the information to be detected by using a pre-established detection model;
outputting a detection result;
wherein, the detection model is established as follows:
step S1, firstly, modeling a dynamic behavior expression jointly representing the user information and a certain event of the user behavior information; the user information comprises the characteristics of the user and the user credibility, and the behavior information comprises the behavior type;
step S2, an event is composed of different information, and an event credibility detection expression is finally obtained by combining the dynamic behavior expression in the step S1;
step S3, using the time characteristic matrix to obtain the user dynamic behavior characteristics in the information transmission process
Step S4, generating a user characteristic expression;
in step S5, the detection model parameters are estimated by the pair learning method.
According to a second aspect of the present invention, there is provided an unreal information detecting apparatus based on dynamic expression learning, including:
the acquisition module is configured to acquire information to be detected;
the detection module is configured to detect the information to be detected by utilizing a pre-established detection model;
an output module configured to output a detection result;
wherein, the detection model is established as follows:
firstly, modeling a dynamic behavior expression jointly representing user information and a certain event of the user behavior information; the user information comprises the characteristics of the user and the user credibility, and the behavior information comprises the behavior type;
an event is composed of different information, and an event reliability detection expression is finally obtained by combining the dynamic behavior expression in the step S1;
method for obtaining user dynamic behavior characteristics in information transmission process by using time characteristic matrix
Generating a user characteristic expression;
and estimating detection model parameters by using a pairing learning method.
The detection model adopted by the invention summarizes various characteristics of the key characteristics of the characterization information, namely user information, behavior information, time information and comment information, and models high-order interactive expression among the characteristics. Therefore, the expression of the microblog or the event which is modeled can be more completely, fully and really learned by vector representation, and the method can be more suitable for complex and changeable social network occasions. The detection model reveals the power law distribution rule of the information quantity along with time, and adopts log according to the rule2The continuous time period is divided into different time intervals, so that the same number of information in each time interval is guaranteed, and all events can be guaranteed to share a similar time scale on the whole. The model can more easily learn the expression of the events and can fully dig out the time law of information distribution. The invention relates to an unreal information detection and transmission task based on dynamic expression learning, in particular to a social network which is real and complex and has large information quantity, long time span, complex semantic scene, user behavior change and the likeIn this case. The dynamic behavior expression of the user is learned to obtain more accurate prediction effect.
Detailed Description
The following describes in detail various problems involved in the technical solutions of the present invention with reference to the accompanying drawings. It should be noted that the described embodiments are only intended to facilitate understanding and do not have any limiting effect on the invention.
As shown in fig. 1, the present invention provides a method for detecting unreal information based on dynamic expression learning, which comprises the following steps:
acquiring information to be detected;
detecting the information to be detected by using a pre-established detection model;
outputting a detection result;
wherein, the detection model is established as follows:
step S1, firstly, modeling a dynamic behavior expression jointly representing the user information and a certain event of the user behavior information; the user information comprises the characteristics of the user and the user credibility, and the behavior information comprises the behavior type;
step S2, an event is composed of different information, and an event credibility detection expression is finally obtained by combining the dynamic behavior expression in the step S1;
step S3, using the time characteristic matrix to obtain the user dynamic behavior characteristics in the information transmission process
Step S4, generating a user characteristic expression;
in step S5, the detection model parameters are estimated by the pair learning method.
The Dynamic behavior expression Model (DBRM for short) provided by the invention is used for detecting unreal information in a social media scene. The model can learn dynamic behavior expression, and can establish a model containing user credibility, dynamic attributes, behavior characteristics and evaluation viewpoints by learning implicit expression. The collection of aspect information generates user behavior representations, and the collection of user dynamic behavior representations generates credibility representations describing information propagated by an event on social media. In the model, each user is represented by a corresponding vector, where the time interval, user behavior and user comments are respectively represented by a matrix. The model further introduces a pair-wise learning method so as to maximize the credibility difference between accurate information and unreal information. Building a DBRM model: 1) each user is represented by a vector with own characteristics (such as gender, attention and number of people concerned) to indicate the credibility of user information; 2) the model combines matrix representation of time interval from propagation from unreal information to microblog release so as to capture dynamic characteristics of user behaviors. Representing user behaviors (such as publishing and forwarding) by using an implicit operation matrix can indicate whether different behavior characteristics and user comments are questioning degrees or not; 3) generating an expression of the information in the propagation process based on the product of the expressions of 2); 4) after combining all the dynamic behavior expression models in 3), we can obtain the credibility expression of the event; 5) we apply a pair-wise learning method to maximize the difference between accurate and unreal information to detect the credibility of information on social media. On the experiment of the Sina microblog data set, the effect more accurate than the prediction of other existing models is obtained.
In order to better understand the role of the DBRM model in the unreal information detection and verify the implementation effect of the present invention, experiments are taken as an example to explain, and the example adopts the xinlang microblog database. The experimental data set was divided into 60% training set, 30% testing set and 10% validation set.
The experiment contained four evaluation indices: accuracy, precision, recall and F1 values. The research object respectively calculates the accuracy and the recall rate for the unreal information and the real information to display the capability of the model for detecting the two kinds of information. The larger the values of the four evaluation indexes are, the stronger the performance of the model is.
The specific experimental steps on the microblog data set are as follows:
in step S1, the traditional user information and behavior information are modeled first. The traditional user information comprises the characteristics of the user and the user credibility, wherein the characteristics of the user comprise the gender of the user, the number of people concerned by the microblog and the number of people concerned by the user; the larger the numerical value of the user credibility is, the more credible the user is; the behavior information comprises a behavior type, such as whether the microblog is originally issued or forwarded, and compared with forwarding, the original issuance is more original and more important for credibility detection. Microblogs with high credibility are often originated by users with high credibility, while some unreal information is often originated by users with low credibility and forwarded by users with high credibility.
For the ith event eiRelated jth microblogExpressions that can jointly represent users and their behaviorRepresenting users in the jth microblog in the ith eventIs represented by a vector of (a). RdRepresenting a d-dimensional real number space.Is a user actionThe implicit matrix of (a) represents, wherein each element is continuously updated in the training process, and d represents the matrix dimension. These expressions may derive the characteristics of the user under a particular behavior.
Besides, the comments of the users play an important role in detecting the credibility of the information. The user can evaluate the information according to life common knowledge and experience. Unrealized information tends to receive more questionable comments according to the following expression. Incorporating microblogsAll comments ofIt can be derived that:whereinIs a commentThe expression is added with the comment attitude of the author.
Accordingly, the combination of the time interval during which autorumor starts to propagate to a particular microblog and user behavior may provide better confidence in the detection. Will incident eiTime interval from start to propagation of corresponding micro-blogAdding the expression to obtain the microblogThe dynamic behavior of (2) expresses:whereinIs a time intervalIs expressed in a matrix of (a).Can be used for representing four different factors to microblogThe combined effect of (a).
In step S2, an event is composed of different microblogs, and an event reliability detection expression can be finally obtained by combining the microblog dynamic behavior expression in step S1. Let event eiComprisesEach microblog and all microblogs form a setCalculating according to the average value to obtain an event eiThe expression of (a) is:predicting an event eiWhether it is unreal information, the expression can be adopted:wherein W ∈ RdIs the linear weight of the prediction function. WhereinRepresents an event eiThe degree of reliability of the system (c),the larger the value of (c), the event eiThe higher the confidence of (c).
In step S3, in the model, the time feature matrix is used to obtain the dynamic behavior features of the user during the information dissemination process. To reduce the problem of data sparseness due to learning different matrices over successive time periods, we divide successive time periods into different time intervals. It is not reasonable to divide the time equally according to the power law distribution diagram of the dynamic behavior. The model is according to log2The time intervals are divided (base 2 logarithm), and only the time intervals corresponding to the upper and lower boundaries of the matrix are learned. For a certain moment in a time interval, their transition matrix can be calculated by nonlinear interpolation. Time characteristic matrix T for a certain time Tt,
WhereinAndrespectively representUpper and lower boundaries of (a).
In step S4, a user representation is generated. For user representation in the model, we can get the features and credibility of the user by learning different potential vector expressions. On average, there are only two behaviors per user and we cannot learn the potential expression of each user. But we can learn the user's expression of features.
The characteristics of the user may include, for example, gender, microblog attention number, microblog number, whether the user is authenticated, and the like. For user u, feature vector Fu,∈Rf,Wherein,andthere are two bits of information that can be,(i.e. 1 st bit of the two-bit information is 1) indicates that the sex is male,indicates gender as female;indicating that the user has been authenticated and,indicating that the user is not authenticated. The number of people concerned, the number of people concerned and the number of microblogs of a user are not easy to express each numerical value, and the numerical values such as the number of people and the number of microblogs are calculated according to log10The distribution is divided into discrete time intervals. If user u has vuBy the spotters, we can derive the corresponding characteristics,
wherein,andrespectively representI denotes the boundary of the section. In the same wayAndthe expression can be constructed in the above-described manner. Based on feature vector FuWe can derive the user expression Uu=SFuWherein S ∈ Rd×fIs a characteristic expression hidden matrix which is continuously learned in the training process.
In step S5, a pair-wise learning method is used to estimate the model parameters. Considering that unreal information is not easy to collect training models, a pairwise learning method is used for expanding a training set. Assuming that the confidence level of accurate information is higher than that of unrealized information, we maximize the difference between the two with the following equation:
wherein Respectively representing real information enAnd unreal information erG (x) is a nonlinear equation, g (x) is 1/(1+ e)-x). In combination with the negative log-likelihood function, we can write the objective function:
where E represents the set of all events, len,lerLabels representing real and unreal information, respectively, Θ ═ { U, B, C, T, W } represent all the parameters calculated, and λ is the parameter that controls the regularization size. It can be derived that J is related to W,andthe reciprocal of (a) is as follows:
wherein,
calculating an eventReciprocal of (2)The gradient of the corresponding parameter can be expressed as:
so as to push out the plastic film out,
after all gradients are calculated, we can calculate the model parameters using a random gradient descent. The above process is repeated until the model converges.
PR curves for various methods on rumor and non-rumor datasets as shown in FIGS. 3(a) and 3(b), respectively;
table 1 below is the statistical information for the data set;
TABLE 1
| Element(s) |
Event(s) |
Rumor |
True information |
Micro blog |
Primary microblog |
Forwarding microblogs |
User' s |
| Number of |
936 |
500 |
436 |
630363 |
98429 |
532236 |
321246 |
Table 2 below shows experimental comparison results of this model with the most advanced model at present:
TABLE 2
The above-described embodiments further explain the objects, technical solutions, and effects of the present invention in detail. It should be understood that the above-mentioned embodiments are merely exemplary of the present invention, and not restrictive, and that any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.