CN106910013A

CN106910013A - Unreal information detecting method and device based on Expression study

Info

Publication number: CN106910013A
Application number: CN201710085225.5A
Authority: CN
Inventors: 谭铁牛; 王亮; 吴书; 刘强; 余峰
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2017-06-30

Abstract

The invention discloses a false information detection method based on dynamic expression learning, comprising the following steps: acquiring information to be detected; using a pre-established detection model to detect the information to be detected; outputting the detection result; wherein, the detection model is established as follows : Step S1, first modeling a dynamic behavior expression that jointly represents user information and a certain event of the user behavior information; user information includes user characteristics and user credibility, and behavior information includes behavior types; step S2, an event is represented by Different information components, combined with the dynamic behavior expression in step S1, finally get the event credibility detection expression; step S3, use the time feature matrix to obtain the user’s dynamic behavior characteristics in the process of information dissemination, and then step S4 to generate User feature expression; step S5, using paired learning method to estimate detection model parameters.

Description

Unreal information detection method and device based on dynamic expression learning

Technical Field

The invention relates to the technical field of computer model detection, in particular to an unreal information detection method and device based on dynamic expression learning.

Background

The rapid development of social media has enabled network users to experience unprecedented convenience. Social media such as Facebook, Twitter, and the newsbook offer a platform for users to share information and publish their personal stories publicly. But at the same time, the propagation of unreal information on the social platform also brings great trouble to users and also harms social harmony and public safety. In recent years, information reliability detection has attracted great attention in academic and industrial fields.

The information considered in the present methods mainly includes the following categories: text information, source credible information, dynamic information and comment information; the factors for evaluating the credibility of the user behavior mainly comprise: time, people, behavior, manner. The fact discovery method is an unsupervised or semi-supervised method for discovering facts and detecting information credibility in conflict data. Based on the above information, the existing methods mainly focus on artificial features, but they are cumbersome and cannot obtain the basic features of data. Meanwhile, the current method cannot simulate the relevance of different information types and different credibility factors during information transmission. The fact discovery method is mainly based on source credibility information and the detected credibility is collected from each source. But the fact finding method is only suitable for ideal case specific topics such as price prediction and flight prediction, and is not suitable for social media in such a complex environment.

In recent years, many methods capable of automatically measuring information reliability in social media have been widely used. These methods are based primarily on textual information and source trust information at the message level or event level. There are also studies that take into account both message and event levels. With respect to dynamic information, some studies define temporal features in the propagation process or train models with different temporal features. For comment information, the research adopts a user feedback or microblog marking mode to indicate suspicious information. Although these methods are widely used, they are complicated in terms of feature engineering and cannot obtain basic features of data. Meanwhile, the current method cannot simulate the relevance of different information types and different credibility factors during information transmission.

The DBRM model aims to determine whether the event is unrealistic information from what users post and forward on social media. Model set user behavior factors: user credibility, event occurrence time interval, user publishing and forwarding behaviors and user comment information can be used for judging the credibility of a message. The model introduces a representation learning method, which, unlike conventional feature engineering, can capture information in different aspects of the propagation process. The model learns the implicit representation (1 event representation) of the user, the dynamic time interval, the user behavior and the comment attitude. Based on these implicit representations, the model can generate a dynamic behavioral representation of the information and present innovations in the detection of trustworthiness.

Disclosure of Invention

In view of the technical defects of the traditional artificial characteristic-based method, the invention provides a detection method and a detection device based on dynamic behavior characteristic representation in order to better detect information reliability.

According to an aspect of the present invention, a method for detecting unreal information based on dynamic expression learning is provided, which includes the following steps:

acquiring information to be detected;

detecting the information to be detected by using a pre-established detection model;

outputting a detection result;

wherein, the detection model is established as follows:

step S1, firstly, modeling a dynamic behavior expression jointly representing the user information and a certain event of the user behavior information; the user information comprises the characteristics of the user and the user credibility, and the behavior information comprises the behavior type;

step S2, an event is composed of different information, and an event credibility detection expression is finally obtained by combining the dynamic behavior expression in the step S1;

step S3, using the time characteristic matrix to obtain the user dynamic behavior characteristics in the information transmission process

Step S4, generating a user characteristic expression;

in step S5, the detection model parameters are estimated by the pair learning method.

According to a second aspect of the present invention, there is provided an unreal information detecting apparatus based on dynamic expression learning, including:

the acquisition module is configured to acquire information to be detected;

the detection module is configured to detect the information to be detected by utilizing a pre-established detection model;

an output module configured to output a detection result;

wherein, the detection model is established as follows:

firstly, modeling a dynamic behavior expression jointly representing user information and a certain event of the user behavior information; the user information comprises the characteristics of the user and the user credibility, and the behavior information comprises the behavior type;

an event is composed of different information, and an event reliability detection expression is finally obtained by combining the dynamic behavior expression in the step S1;

method for obtaining user dynamic behavior characteristics in information transmission process by using time characteristic matrix

Generating a user characteristic expression;

and estimating detection model parameters by using a pairing learning method.

The detection model adopted by the invention summarizes various characteristics of the key characteristics of the characterization information, namely user information, behavior information, time information and comment information, and models high-order interactive expression among the characteristics. Therefore, the expression of the microblog or the event which is modeled can be more completely, fully and really learned by vector representation, and the method can be more suitable for complex and changeable social network occasions. The detection model reveals the power law distribution rule of the information quantity along with time, and adopts log according to the rule₂The continuous time period is divided into different time intervals, so that the same number of information in each time interval is guaranteed, and all events can be guaranteed to share a similar time scale on the whole. The model can more easily learn the expression of the events and can fully dig out the time law of information distribution. The invention relates to an unreal information detection and transmission task based on dynamic expression learning, in particular to a social network which is real and complex and has large information quantity, long time span, complex semantic scene, user behavior change and the likeIn this case. The dynamic behavior expression of the user is learned to obtain more accurate prediction effect.

Drawings

FIG. 1 is a flow chart of a method for detecting unreal information based on dynamic expression learning according to the present invention;

FIG. 2 is a schematic diagram of the expression learning process of the dynamic behavior expression model DBRM in the present invention;

fig. 3(a) and 3(b) are accuracy-recall curves for rumors (a) and real information (b) for different comparison methods.

Detailed Description

The following describes in detail various problems involved in the technical solutions of the present invention with reference to the accompanying drawings. It should be noted that the described embodiments are only intended to facilitate understanding and do not have any limiting effect on the invention.

As shown in fig. 1, the present invention provides a method for detecting unreal information based on dynamic expression learning, which comprises the following steps:

acquiring information to be detected;

outputting a detection result;

wherein, the detection model is established as follows:

Step S4, generating a user characteristic expression;

The Dynamic behavior expression Model (DBRM for short) provided by the invention is used for detecting unreal information in a social media scene. The model can learn dynamic behavior expression, and can establish a model containing user credibility, dynamic attributes, behavior characteristics and evaluation viewpoints by learning implicit expression. The collection of aspect information generates user behavior representations, and the collection of user dynamic behavior representations generates credibility representations describing information propagated by an event on social media. In the model, each user is represented by a corresponding vector, where the time interval, user behavior and user comments are respectively represented by a matrix. The model further introduces a pair-wise learning method so as to maximize the credibility difference between accurate information and unreal information. Building a DBRM model: 1) each user is represented by a vector with own characteristics (such as gender, attention and number of people concerned) to indicate the credibility of user information; 2) the model combines matrix representation of time interval from propagation from unreal information to microblog release so as to capture dynamic characteristics of user behaviors. Representing user behaviors (such as publishing and forwarding) by using an implicit operation matrix can indicate whether different behavior characteristics and user comments are questioning degrees or not; 3) generating an expression of the information in the propagation process based on the product of the expressions of 2); 4) after combining all the dynamic behavior expression models in 3), we can obtain the credibility expression of the event; 5) we apply a pair-wise learning method to maximize the difference between accurate and unreal information to detect the credibility of information on social media. On the experiment of the Sina microblog data set, the effect more accurate than the prediction of other existing models is obtained.

In order to better understand the role of the DBRM model in the unreal information detection and verify the implementation effect of the present invention, experiments are taken as an example to explain, and the example adopts the xinlang microblog database. The experimental data set was divided into 60% training set, 30% testing set and 10% validation set.

The experiment contained four evaluation indices: accuracy, precision, recall and F1 values. The research object respectively calculates the accuracy and the recall rate for the unreal information and the real information to display the capability of the model for detecting the two kinds of information. The larger the values of the four evaluation indexes are, the stronger the performance of the model is.

The specific experimental steps on the microblog data set are as follows:

in step S1, the traditional user information and behavior information are modeled first. The traditional user information comprises the characteristics of the user and the user credibility, wherein the characteristics of the user comprise the gender of the user, the number of people concerned by the microblog and the number of people concerned by the user; the larger the numerical value of the user credibility is, the more credible the user is; the behavior information comprises a behavior type, such as whether the microblog is originally issued or forwarded, and compared with forwarding, the original issuance is more original and more important for credibility detection. Microblogs with high credibility are often originated by users with high credibility, while some unreal information is often originated by users with low credibility and forwarded by users with high credibility.

For the ith event e_iRelated jth microblogExpressions that can jointly represent users and their behaviorRepresenting users in the jth microblog in the ith eventIs represented by a vector of (a). R^dRepresenting a d-dimensional real number space.Is a user actionThe implicit matrix of (a) represents, wherein each element is continuously updated in the training process, and d represents the matrix dimension. These expressions may derive the characteristics of the user under a particular behavior.

Besides, the comments of the users play an important role in detecting the credibility of the information. The user can evaluate the information according to life common knowledge and experience. Unrealized information tends to receive more questionable comments according to the following expression. Incorporating microblogsAll comments ofIt can be derived that:whereinIs a commentThe expression is added with the comment attitude of the author.

Accordingly, the combination of the time interval during which autorumor starts to propagate to a particular microblog and user behavior may provide better confidence in the detection. Will incident e_iTime interval from start to propagation of corresponding micro-blogAdding the expression to obtain the microblogThe dynamic behavior of (2) expresses:whereinIs a time intervalIs expressed in a matrix of (a).Can be used for representing four different factors to microblogThe combined effect of (a).

In step S2, an event is composed of different microblogs, and an event reliability detection expression can be finally obtained by combining the microblog dynamic behavior expression in step S1. Let event e_iComprisesEach microblog and all microblogs form a setCalculating according to the average value to obtain an event e_iThe expression of (a) is:predicting an event e_iWhether it is unreal information, the expression can be adopted:wherein W ∈ R^dIs the linear weight of the prediction function. WhereinRepresents an event e_iThe degree of reliability of the system (c),the larger the value of (c), the event e_iThe higher the confidence of (c).

In step S3, in the model, the time feature matrix is used to obtain the dynamic behavior features of the user during the information dissemination process. To reduce the problem of data sparseness due to learning different matrices over successive time periods, we divide successive time periods into different time intervals. It is not reasonable to divide the time equally according to the power law distribution diagram of the dynamic behavior. The model is according to log₂The time intervals are divided (base 2 logarithm), and only the time intervals corresponding to the upper and lower boundaries of the matrix are learned. For a certain moment in a time interval, their transition matrix can be calculated by nonlinear interpolation. Time characteristic matrix T for a certain time T_t，

WhereinAndrespectively representUpper and lower boundaries of (a).

In step S4, a user representation is generated. For user representation in the model, we can get the features and credibility of the user by learning different potential vector expressions. On average, there are only two behaviors per user and we cannot learn the potential expression of each user. But we can learn the user's expression of features.

The characteristics of the user may include, for example, gender, microblog attention number, microblog number, whether the user is authenticated, and the like. For user u, feature vector F_u，∈R^f，Wherein,andthere are two bits of information that can be,(i.e. 1 st bit of the two-bit information is 1) indicates that the sex is male,indicates gender as female;indicating that the user has been authenticated and,indicating that the user is not authenticated. The number of people concerned, the number of people concerned and the number of microblogs of a user are not easy to express each numerical value, and the numerical values such as the number of people and the number of microblogs are calculated according to log₁₀The distribution is divided into discrete time intervals. If user u has v_uBy the spotters, we can derive the corresponding characteristics,

wherein,andrespectively representI denotes the boundary of the section. In the same wayAndthe expression can be constructed in the above-described manner. Based on feature vector F_uWe can derive the user expression U_u＝SF_uWherein S ∈ R^d×fIs a characteristic expression hidden matrix which is continuously learned in the training process.

In step S5, a pair-wise learning method is used to estimate the model parameters. Considering that unreal information is not easy to collect training models, a pairwise learning method is used for expanding a training set. Assuming that the confidence level of accurate information is higher than that of unrealized information, we maximize the difference between the two with the following equation:

wherein Respectively representing real information e_nAnd unreal information e_rG (x) is a nonlinear equation, g (x) is 1/(1+ e)^-x). In combination with the negative log-likelihood function, we can write the objective function:

where E represents the set of all events, l_en，l_erLabels representing real and unreal information, respectively, Θ ═ { U, B, C, T, W } represent all the parameters calculated, and λ is the parameter that controls the regularization size. It can be derived that J is related to W,andthe reciprocal of (a) is as follows:

wherein,

calculating an eventReciprocal of (2)The gradient of the corresponding parameter can be expressed as:

so as to push out the plastic film out,

after all gradients are calculated, we can calculate the model parameters using a random gradient descent. The above process is repeated until the model converges.

PR curves for various methods on rumor and non-rumor datasets as shown in FIGS. 3(a) and 3(b), respectively;

table 1 below is the statistical information for the data set;

TABLE 1

Element(s)	Event(s)	Rumor	True information	Micro blog	Primary microblog	Forwarding microblogs	User' s
								Number of	936	500	436	630363	98429	532236	321246

Table 2 below shows experimental comparison results of this model with the most advanced model at present:

TABLE 2

The above-described embodiments further explain the objects, technical solutions, and effects of the present invention in detail. It should be understood that the above-mentioned embodiments are merely exemplary of the present invention, and not restrictive, and that any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A false information detection method based on dynamic expression learning, comprising the following steps:

Obtain the information to be detected;

output test results;

Among them, the detection model is established as follows:

Step S1, first modeling a dynamic behavior expression that jointly represents user information and a certain event of the user behavior information; user information includes user characteristics and user credibility, and behavior information includes behavior types;

In step S2, an event is composed of different information, combined with the dynamic behavior expression in step S1, finally obtain the event credibility detection expression;

Step S3, using the time feature matrix to obtain the user dynamic behavior feature connection during the information dissemination process.

Step S4, generating a user feature expression;

Step S5, estimating the parameters of the detection model by using the paired learning method.

2. The method according to claim 1, wherein the information is microblog information; the user information includes user characteristics and user credibility, and the user characteristics include user gender, number of microblog followers and the number of followers; the larger the value of user credibility, the more credible the user; behavior information includes the original microblog or forwarded microblog.

3. The method according to claim 1, wherein the dynamic behavior expression is as follows:

{R R}_{j j}^{{e e}_{i i}} = = {T T}_{{t t}_{j j}}^{{e e}_{i i}} {C C}_{{b b}_{j j}}^{{e e}_{i i}} {B B}_{{b b}_{j j}}^{{e e}_{i i}} {U u}_{{u u}_{j j}}^{{e e}_{i i}},,

in, Represents the user in the j-th microblog in the i-th event e _i The vector representation; R ^d represents the d-dimensional real number space; is user behavior The implicit matrix representation of ; is a comment The matrix expression of is the time interval matrix expression.

4. The method according to claim 1, wherein the event reliability detection expression is as follows:

{y the y}^{e e i i} = = {W W}^{T T} {R R}^{{e e}_{i i}}

in, Represents the credibility of event e _i , W∈R ^d is the linear weight of the prediction function, An expression representing an event e _i .

5. The method according to claim 1, characterized in that, in step S3, the continuous time period is divided into different time intervals, the time interval is divided according to log ₂ , and only the upper boundary and the lower boundary corresponding to the learning time interval, and for a certain moment in a time interval, its time feature matrix (a transition matrix) is calculated by nonlinear interpolation.

6. The method according to claim 1, characterized in that in step S5, the difference between real information and false information is distinguished by the following expression:

p p (({e e}_{n no} > > {e e}_{r r})) = = g g (({y the y}^{{e e}_{n no}} - - {y the y}^{{e e}_{r r}}))

in Respectively represent the credibility of real information e _n and false information e _r , g(x) is a nonlinear equation, g(x)=1/(1+e ^-x );

The objective function of the detection model is expressed as follows:

J J = = \underset{{{{e e}_{n no},, {e e}_{r r}}} &Element; &Element; E E.,, {l l}_{{e e}_{n no}} = = 11,, {l l}_{{e e}_{r r}} = = 00}{Σ Σ} ln ln ((11 + + {e e}^{- - {W W}^{T T} (({R R}^{{e e}_{n no}} - - {R R}^{{e e}_{r r}}))})) + + \frac{λ λ}{22} | | | | Θ Θ | | {| |}^{22},,

Among them, E represents the set of all events, l _en and l _er represent the labels of real information and false information respectively, Θ represents all the calculated detection model parameters, with Represent the dynamic behavior expressions of real information e _n and false information e _r respectively, W∈R ^d is the linear weight of the prediction function; λ is a parameter controlling the size of regularization.

7. The method according to claim 1, characterized in that the detection model summarizes a variety of features that characterize the key characteristics of information, namely user information, behavior information, time information and comment information, and models the high correlation between these features. interactive expression.

8. The method according to claim 1, wherein the detection model reveals the power-law distribution law of information quantity _over time, and adopts log2 to divide the continuous time period into different time intervals according to this law, not only ensuring that each There is the same number of messages in a time interval, and it is guaranteed that all events share a similar time scale as a whole.

9. A false information detection device based on dynamic expression learning, comprising:

an acquisition module configured to acquire information to be detected;

A detection module configured to detect the information to be detected by using a pre-established detection model;

an output module configured to output a detection result;

Among them, the detection model is established as follows:

First, model a dynamic behavior expression that jointly represents user information and a certain event of the user behavior information; user information includes user characteristics and user credibility, and behavior information includes behavior types;

An event is composed of different information, and combined with the dynamic behavior expression in step S1, the event credibility detection expression is finally obtained;

Using the time feature matrix to obtain the dynamic behavior characteristics of users in the process of information dissemination

generate user profile;

Estimation of detection model parameters using paired learning.