This article includes a list of general references, but it remains largely unverified because it lacks sufficient corresponding inline citations. (April 2020) |
This article is written like a personal reflection, personal essay, or argumentative essay that states a Wikipedia editor's personal feelings or presents an original argument about a topic. (April 2020) |
Detection of fake news online is important in today's society as fresh news content is rapidly being produced as a result of the abundance of available technology. Claire Wardle has identified seven main categories of fake news, and within each category, the fake news content can be either visual and/or linguistic-based. In order to detect fake news, both linguistic and non-linguistic cues can be analyzed using several methods. While many of these methods of detecting fake news are generally successful, they do have some limitations.
With the advancement of technology, digital news is more widely exposed to users globally and contributes to the increment of spreading hoaxes and disinformation online. Fake news can be found through popular platforms such as social media and the Internet. There have been multiple solutions and efforts in the detection of fake news where it even works with artificial intelligence tools. However, fake news intends to convince the reader to believe false information which deems these articles difficult to perceive. The rate of producing digital news is large and quick, running daily at every second, thus it is challenging for machine learning to effectively detect fake news.[1]
In the discourse of not being able to detect fake news, the world would no longer hold value in truth. Fake news paves the way for deceiving others and promoting ideologies. These people who produce the wrong information benefit by earning money with the number of interactions on their publications. Spreading disinformation holds various intentions, in particular, to gain favor in political elections, for business and products, done out of spite or revenge. Humans can be gullible and fake news is challenging to differentiate from the normal news. Most are easily influenced especially by the sharing of friends and family due to relations and trust. We tend to base our emotions from the news, which makes accepting not difficult when it is relevant and stance from our own beliefs. Therefore, we become satisfied with what we want to hear and fall into these traps.[2]
Fake news appears in different forms and the examples of their features are clickbait, propaganda, satire or parody, sloppy journalism, misleading headings and biased or slanted news. Claire Wardle of First Draft News has identified seven types of fake news.[3]
Types of Fake News | Description |
---|---|
Satire or parody |
Satire or parody is where the information has the potential to fool and may be misinterpreted as fact. Satire does not necessarily cause harm as it takes stories from news sources and uses ridicule and sarcasm. Parodies focus on its content and are explicitly produced for entertainment purposes.[4] |
False connection | A false connection is obvious when headlines, visuals or captions do not support the content. The kind of news that is built on poor journalism with unrelated attributes to attract attention and used for profit. For example, reading a headline that states the death of a celebrity but upon clicking, the content of the entire article does not mention the celebrity |
Misleading content | Misleading content is the type of fake news that uses information to frame an issue or an individual. A popular form of news used by politicians to bring down their opponents by making false claims with possibly some truth. |
False context | False context includes false contextual information that is shared around genuine content. |
Impostor content | Derives from a false or made-up source which impersonates a real news source. |
Manipulated content |
Presents genuine information or imagery but deceives to tell a different story. |
Fabricated content |
New and fully fabricated content that is 100% false with the intention to deceive and cause harm.
|
Visual-based type of fake news uses content that integrates multiple forms of media which includes graphical representation such as Photoshopped images and videos. Visual news that grabs the attention of the viewers are mainly posted on platforms like social media and media sites. Facebook, Instagram and Twitter are popular examples of frequently used social media to post and share online content thus circulate to many other users. More than 70% of its users utilize them as daily news sources to receive the latest and quickest updates. Media sites are operated by content media companies and their content focus on a wide range of visuals and design their sites based on style and user's interest. [5]
Linguistic-based type of fake news is in the form of text or string content and generally analysed by text linguistics. Its content largely focuses on text as a communication system and includes characteristics like tone, grammar, and pragmatics that allows discourse analysis. Examples of linguistic-based platforms are blog sites, emails and news sites. Blog sites are managed by users and the content produced is unsupervised which considers it easy to receive wrong information. Email is another medium where its users can receive news and this poses as a challenge to detect and validate their authenticity. It is known that hoaxes, spam and junk mails are infamously spread through emails. Popular news websites too can generate their own content and attract users with their authentic presence.[5]
Characteristics of fake news (cues) are extracted from the source, headline, body text, visual content and social engagement of authors.
The 'Bag of Words' approach evaluates individual words as a single, significant unit. The frequency of each word (or n-grams) is obtained and the frequencies are aggregated and analysed for deceptive cues. The challenge of this approach is that it is language-dependent. It depends on individual n-grams, which are typically analysed separately from useful contextual information.[6]
The LIWC (Linguistic Inquiry and Word Count) lexicon can be used to extract suitable proportions of words which in turn, will aid in the extraction of psycholinguistic features. This enables the system “to ascertain the tone of the language (e.g., positive emotions, perceptual process, etc.), statistics of the text (e.g.: word counts) and part of speech category (e.g.: articles, verbs)”. LIWC is a useful tool as it is able to “cluster single LIWC categories into multiple feature sets such as summary categories (e.g., analytical thinking, emotional tone), linguistic processes (e.g., function words, pronouns), and psychological processes (e.g., effective processes, social processes)”.[5]
The veracity of content can be assessed by analysing its readability. This includes picking out content features such as the number of characters, complex words, number of syllables and word types, among many others, which enables users to carry out readability metrics such as Flesch-Kincaid, Flesch Reading Ease, Gunning Fog, and the Automated readability index (ARI).[5]
Using discourse analysis, the truthfulness of an article's content can be evaluated. The Rhetorical Structure Theory (RST) analytic framework can be used to pinpoint rhetoric relations between linguistics components. Differences between honest and dishonest content in terms of coherence and structure can be evaluated together using a Vector Space Model (VSM). An individual content's position in a multi-dimensional RST space can be assessed with respect to its distance from truth and deception. Conspicuous usage of specific rhetorical relations might suggest deception. However, although there are tools to automatically classify rhetorical relations, it has yet to be officially used as an assessment tool for veracity.[6]
Deeper language structures, also known as syntax, are analysed to detect deception. “Features based on context-free grammar (CFG) are picked out and these features depend largely on lexicalised production rules that are combined with their parent & grandparent nodes” . The challenge is that syntax analysis on its own may not be the best at detecting deception, hence it is usually used in combination with other linguistic or network analysis methods.[6]
Veracity of content can be assessed by analyzing the compatibility between the content and the profile it is derived from. This approach is an extension of the n-gram and syntax analysis approaches. Firstly, deception can be identified by contradictions or omission of facts that were present in the user's previous posts on similar topics. For example, for a product review, a truthful review will most likely be written by a writer who makes similar remarks about features of the product that most reviewers will comment on. Secondly, deception can also be detected through content that is extracted from keywords, which contains the attribute:descriptor pair. Profiles and description of the author's experiences are matched, and the veracity of the described content is evaluated by assessing the compatibility scores – the content's compatibility with the existence of a distinct aspect and a general aspect of what it is actually describing. This approach predicts falsehood at approximately 91% of accuracy. This approach is shown to be valuable in the context of reviews but currently, it has only been effective in this domain. The challenge lies in the ability to determine the alignment of attributes:descriptor because it depends on the amount of content of the profiles and the accuracy of the associated attributes to descriptors.[6]
Visual based cues are prevalent in all types of news content. The veracity of visual elements such as images and videos are assessed using visual features like clarity, coherence, diversity, clustering score, and similarity distribution histogram, as well as statistical features like count, image, multi-image, hot image and long image ratio etc.[7]
Linked data approach
The linked data approach uses a current collection of human knowledge to assess the veracity of new statements. It relies on querying available knowledge networks and publicly structured data like the DBpedia Ontology or Google relation Extraction Corpus (GREC). How this system works is that the closer the node representing the new statement is to the node that represents the existing factual statements, the more likely that the new statement is true. The challenge is that the statements must be found in a pre-existing knowledge bank.[6]
Sentiment is based on the unintended, judgement or affective state. Syntactic patterns of content can be evaluated to identify emotions from factual arguments by analysing patterns of argumentation style classes. The fake negative reviewers used excessive negative emotion terms as compared to the honest ones as they tried to exaggerate a particular sentiment they were trying to express.[6]
Social context features can be extracted from the user's social engagements on social media platforms. It reveals the proliferation process which will provide auxiliary information that suggests its veracity. Social context features can be evaluated in 3 aspects – User based, post based and network based.[7]
User-based
It was suggested that fake news is more likely to be created and spread by social bots or cyborgs. By analyzing the user's interaction with news on social media, user-based social context features can be identified and characterized.
Individual level features infers the credibility and reliability of each user. Information like registration age, follow/following count and authored tweets are extracted.
Group level features captures overall characteristics of groups of users related to the news. Spreaders of news may form communities with certain characteristics. Information like percentage of verified users and followers are used.
Post-based
Emotions and opinions of fake news through social media posts can be analysed. Post-based features can be used to identify fake news via reactions expressed in the post.
Post level features analyses linguistic-based features that can be applied to identify unique features for each post. Special features include stance, topic and credibility. Stance reveals the user's opinions towards news. Topic is extracted using topic models like the latent Dirichlet allocation (LDA). Credibility assesses the degree of reliability.
Group level features aggregates the feature value for all relevant posts for news articles using crowd wisdom.
Temporal level features monitors the temporal variations of post level feature value. It uses unsupervised embedding methods like the recurrent neural network (RNN) to monitor changes in the post over time.
Social network approach
Users create networks based on their interest, topics and relations. Fake news spreads like an echo chamber cycle; it identifies the value of extracting network-based features to represent network patterns for fake news detection. Network based features are extracted by creating specific networks among users who authored related social media posts.
In the case of Twitter, the stance network is built with nodes that show tweets that are related to the news. Edges show the similarity of stances. The Co-occurrence network depends on user engagements. User's authored posts related to the same news articles were counted.
The friendship network shows the structure between followers and followee's related tweets. An extension of the friendship network is the diffusion network which tracks the trajectory of the spread of news. Nodes represent users and edges represent the diffusion path of information among them. This network only exists if both users follow each other and the first user posts about a news after the second user does.
Deep syntax can be analysed using Probabilistic context-free grammar (PCFG). Syntax structures are described by changing sentences into parse trees. Nouns, verbs etc. are rewritten into their syntactic constituent parts. Probabilities are assigned to the parse tree. This method identifies the rule categories like lexicalization and parent nodes etc. It detects deception with 85-91% accuracy, depending on the category used in the analysis.[8]
A model to detect fake news on social media by classifying propagation paths of news was proposed. The propagation path of each news story is modeled as a multivariate time series – Each tuple indicates the characteristics of the user who participates in the propagation of the news. A time series classifier is built with recurrent and convolutional networks to predict the veracity of the news story. Recurrent and convolutional networks are able to learn global and local variations of under characteristics which will in turn help characterize clues for the detection of fake news. [9] Clustering based methods can be used to detect fake news with a success rate of 63% through the classification of fake news and real news. How clustering works is that a large number of data is fed to a machine that contains an algorithm that will create a small number of clusters via agglomeration clustering with the k-nearest neighbour approach. This approach “clusters similar news reports based on the normalized frequency of relations” and after the real and fake news cluster centers were computed, this model is able to ascertain the deceptive value of a new article based on the principle of coordinate distances, where its Euclidean distances to the real and fake news cluster centers are calculated. However, the challenge of this approach is that it may be less accurate if applied on fake news articles that are relatively new because similar news story sets may not be accessible yet.[6]
The detection of fake news can also be achieved through predictive modelling based methods. One type would be the logistic regression model. In this model, positive coefficients increase the probability of truth while negative ones increase the probability of deception. “Authors claimed that regression indicators like, Disjunction, Purpose, Restatement, and Solutionhood points to truth, and Condition regression indicator pointed to deception”.[5]
Fact checking is a form of “knowledge-based study of fake news” which focuses on assessing the truthfulness of news. There are two types of fact checking, namely manual and automatic.[10]
The process of manual fact checking is one that is done by humans and it can be done by either experts or regular people.
Expert-based
This method depends on professionals in the fact-checking field, also called fact-checkers, to authenticate a particular news content. It is typically done by a few but very reliable fact-checkers. This approach is relatively simple to conduct and is also very accurate. However, the disadvantages of this method are that it is expensive and the system is likely to be overwhelmed as the amount of news content to be verified increases.
Crowd-sourced
This alternative type of fact checking requires a huge group of normal individuals who serve as fact checkers. This form of fact checking is not as easy to conduct and the results are likely to be less reliable and accurate due to the biases of the fact checkers as well as possible clashes between them in annotations of the news content. However, as compared to expert-based fact checking, it is less likely for this crowdsourcing fact checking system to be overwhelmed when the volume of news content to be authenticated increases.
In this type of fact checking, it is important to sieve out unreliable users and iron out any results that may contrast each other. These concerns would become more crucial as the fact checking population expands.
Nevertheless, individuals who fact check on these crowd-sourcing sites are more able to provide more comprehensive feedback such as including their attitudes or opinions.
A big problem with manual fact-checking is that the systems are easily overwhelmed by growing numbers of fresh news content that needs to be checked, which is very prevalent in the case of social media. Hence, automatic fact checking methods have been created to combat this problem. These approaches mostly depend on “Information Retrieval (IR) and Natural Language Processing (NLP) techniques, as well as on network/graph theory”. Automatic fact checking methods generally comprise two steps, fact extraction and fact checking. In fact extraction, also known as knowledge-base construction, knowledge is taken from the Web as “raw facts” and it is typically unnecessary, obsolete, conflicting, inaccurate or not complete. They will then be refined and cleaned up by “knowledge processing tasks to build a knowledge-base or a knowledge graph”. Secondly, fact checking, also known as knowledge comparison, is done to assess the veracity of the news content. This is accomplished by matching the knowledge taken from the to-be-checked news content against the facts found in the current “knowledge-base(s) or knowledge graph(s)”.
Deception detection strategies fall under a “style-based study of fake news” and it mainly aims to identify fake news by looking at its style. A popular strategy for style-based deception detection is using “a feature vector representing the content style of the given information within a machine learning framework” to determine if the information is deceitful, which calls for classification, or how deceitful it is, which calls for regression.[10]
Propagation-based detection is the analysis of the dissemination of fake news. There are two types: cascade-based, and network-based fake news detection[10]
A tree or tree-like structure is often used to represent a fake news cascade. It shows the propagation of fake news on social networks by users. The root node is represented by the user who publishes the fake news. The rest of the nodes represent users who subsequently disseminate the news by forwarding or posting it. The cascade is represented in terms of the number of steps the fake news has traveled, which is known as the Hops-based fake news cascade, or the number of times it was posted, which is known as the Time-based fake news cascade. The Hops-based fake news cascade is often represented as a standard tree consisting of parameters like depth, which is the maximum number of steps (hops) taken, breadth, which is the number of user who have received the fake news after it was posted, and size, which is the total number of users represented in the cascade. The Time-based fake news cascade is often represented by a tree-like structure consisting of parameters like lifetime, which is the longest interval for the propagation of fake news, real-time heat, which is the number of users forwarding and reposting the fake news at time t, and overall heat, which is the overall number of users that forwarded or reposted the fake news.
Utilizing graph kernels to analyze cascade-similarity
The similarity between the news cascades can be computed by using graph kernels and used within a supervised learning framework as a feature to detect fake news.
A graph-kernel based hybrid support-vector machine (SVM) classifier that will record propagation patterns that are high-order (i.e. cascade similarities) was suggested, in addition to features like topics and sentiments. User roles (i.e. opinion leader or normal user), approval, sentiment and doubt scores are evaluated additionally.
Assuming that cascades of fake news are different from the cascades of real news, a random walk (RW) graph kernel kRW (·, ·) was used to detect fake news by computing the differences in distance between the 2 cascades.
Utilizing cascade representations
Informative representations of cascades can be useful as features in a supervised learning framework. Other than using feature-engineering, which are not automatic, representation learning, which is often achieved by deep learning, can be used to represent a cascade too. Deep learning by creating a tree-like neural network, Recursive Neural network (RNNs), was utilized, according to the fake news cascades. This method can automatically represent news that is to be verified. However, because the depth of the cascade is equivalent to the depth of the neural network, it will be challenging because methods of deep learning are sensitive.
Flexible networks can be constructed by network-based fake news detection to capture the propagation of fake news indirectly. The networks can be homogeneous, heterogeneous or hierarchical.
Homogeneous network
Homogeneous networks contain 1 type of node and 1 type of edge. The stance network is a classic homogeneous network in which nodes represent the user's news-related post and the edges represent the positive or negative relation between posts. It evaluates the veracity of news-related posts.
Heterogeneous network
Heterogeneous networks consist of nodes and edges of multiple types. It is typically a hybrid framework which is made up of 3 components – representation and embedding of entity, modeling of relation and semi-supervised learning. One example would be the tri-relationship network between news publishers, news articles and news proliferators.
Hierarchical network
Hierarchical networks consist of nodes and edges of various types which form a set-subset relationship (i.e. a hierarchy). News verification is turned into a graph optimization problem in this network.
This approach looks at fake news “based on news-related and social-related information. For instance, intuitively, a news article published on unreliable website(s) and forwarded by unreliable user(s) is more likely to be fake news than news posted by authoritative and credible users”. In other words, this approach focuses on the source of the news content. As such, the credibility perspective of studying fake news generally overlaps with a propagation-based study of fake news.[10]
This method typically revolves around identifying clickbait, which are headers that aim to capture users’ attention and lead them to click on a link to a certain web page. Existing clickbait detection studies use both “linguistic features such as term frequencies, readability, and forward references and non-linguistic features such as webpage links”.[11] “user interests”, “and headline stance” “within a supervised learning framework such as gradient boosted decision trees” “to identify or block clickbaits”. Empirical studies have suggested that clickbaits are typically defined by “a cardinal number, easy readability, strong nouns and adjectives to convey authority and sensationalism”
This approach generally looks at the “quality, credibility, and political bias of source websites” in order to assess the quality and reliability of news content.
The credibility of news content can also be evaluated via the credibility of the comments associated with it. “User comments on news websites and social media carry invaluable information on stances and opinion” although it is very common for them to be overlooked. There are a few models that can be used to assess comment credibility and they can be classified into three types, content-based, behavior-based and graph(network)-based.
Content-based models
These models evaluate comment credibility by leveraging on language features taken from user comments and the strategy it adopts is comparable to that of style-based fake news detection.
Behavior-based models
These models often make use of the “indicative features of unreliable comments extracted from the metadata associated with user behavior”.
Looking at review spam detection studies, these related behavioral attributes can be sorted into five categories, namely, burstiness, activity, timeliness, similarity, and extremity.
Graph-based models
Lastly, these models focus on the relationships among reviewers, comments, products and so on. To evaluate the reliability of news comments, graph-based models frequently use “Probabilistic Graphical Models (PGMs), web ranking algorithms and centrality measures, or matrix decomposition techniques”.
Lastly, the credibility of news content can also be evaluated by looking at the users who spread the particular news content and assessing their reliability. Users are a vital part of the propagation of deceptive news as they can spread fake news through various ways such as sharing, forwarding, liking and reviewing. In this process, users can be categorized into two types, malicious users who typically have low reliability and normal users who generally have higher reliability. Malicious users deliberately spread deceptive news in search of monetary and/or non-monetary benefits such as power and popularity. This group of users can be split into three categories. Firstly, bots, which are software applications “that run automated tasks or scripts over the Internet” . Secondly, trolls, which are people who bicker or agitate other users with the aim of distracting and ruining relationships between people. They generally do this by posting provocative, digressive or irrelevant messages in order to instigate other users to respond with strong emotional content. The last category is cyborgs which are accounts that are registered by humans as a cover in order to run “automated programs performing online activities” . On the contrary, naïve users are regular users who inadvertently join in on the spreading of deceptive news as they misinterpret deceitful news to be the truth. There are two main factors that have been studied that may help explain why naïve users may participate in the spreading of fake news. The first factor is social influence, which “refers to environmental and exogenous factors such as network structure or peer pressure that can influence the dynamics of fake news” . This is demonstrated by “the bandwagon effect, normative influence theory and social identity theory” which illustrates that “peer pressure psychologically impacts user behavior towards fake-news-related activities”. The second factor is self influence. This refers to the intrinsic characteristics of users that can affect how they react to or handle deceptive news. For instance, according to confirmation bias and naïve realism, users are more likely to believe in deceptive news or participate in its related activities if it validates their pre-existing knowledge.
The credibility in Twitter events by creating a data set of tweets that is relevant to trending topics was detected. Using crowd sourcing, they annotated the data sets regarding the veracity of each tweet. 4 features, namely message, user, topic and propagation were analysed using a Decision Tree Model. This method achieved 86% accuracy. Benevuto et al.[citation needed] came up with a model that detects spammers by constructing a manual annotated dataset of 1000 records of spam and non-spam accounts. Attributes on content and user behavior were extracted and analysed. This method successfully detected 70% of spam accounts and 96% of non-spam accounts. Chu et al.[citation needed] developed a similar detection model which distinguished bot accounts. 3 groups were categorized - humans, bots and cyborgs. A system was built with 4 features of analysis, namely entropy measures, spam detection, account properties and decision making. This method successfully identified the ‘human’ class at 96% accuracy. [12]
Browser plugins can detect deceptive content like click bait, bias, conspiracy theory and junk science on social media websites. One example is the ‘Fake News Detector’ which uses machine learning technique to gather a ground truth data set. In addition, crowd wisdom is being used to improve and allow the program to learn. Another example of a browser add-on developed was one that was created by 4 college students during a hackathon by Princeton University. This system makes a real-time analysis of the user's feed and notifies the user of posting or sharing any potentially false content by analyzing keywords, images and sources. [12]
Fake news is not something that is new however. As technology evolves and advances over time, the detection of fake news also becomes more challenging as social media continues to dominate our everyday lives and hence accelerating the speed of which fake news travels. [13] In a recent study published by the journal Science, it analysed millions of tweets sent between 2006 and 2017 and it was found that: “Falsehood diffused significantly farther, faster, deeper, and more broadly than the truth in all categories of information.” It also concluded that “it took the truth about six times as long as falsehood to reach 1,500 people.” Also other than just the sheer speed of how fast fake news travels, it is also more challenging to detect it simply because of how attractive most fake news articles are titled. The same Science paper also revealed that replies to false news tweets contained more expressions of surprise or disgust than true news. [14]
Varied linguistics cues imply that a new cue set must be designed for a prospective situation which makes it difficult to generalize cue and feature engineering methods across different topics and domains. Such approaches therefore would require more human involvement in the design process, evaluation and utilization of these cues for detection.[15]
Although this form of method is often deemed to be better than cue-based methods it unfortunately still does not extract and fully exploit the rich semantic and syntactic information in the content. E.g.: The N-gram approach is simple, however it cannot model more complicated contextual dependencies of the text. Syntactic features used alone are also less powerful than word based n-grams and a superficial combination of the two would not be effective in capturing the complex interdependence.[15]
Fake news detection is still a challenge even to deep learning methods such as Convolutional Neural Network (CNN), Recurrent neural network (RNN), etc., because the content of fake news is planned in a way it resembles the truth so as to deceive readers; and without cross referencing and fact checking, it is often difficult to determine veracity by text analysis alone.[15]
The issue with existing feedback based methods (e.g.: Response User Analysis, Response text analysis, Temporal Pattern Analysis, Propagation Pattern Analysis and Hand-engineered analysis) is the type of training data that models are being trained on. It is usually a snapshot of users’ responses that are usually collected after or towards the end of the propagation process when sufficient responses are available. This encourages and provides a reason for the decreased quality in performance on early detection using trained models when there are fewer responses collected. The methods also do not have the ability to update their state based on incrementally available users' responses.[15]
Intervention based methods (Decontamination, Network monitoring, Crowdsourcing and User Behaviour Modeling) tend to be more difficult to evaluate and tested especially in complex environments where there are many interdependent connections and transactions. Also they might make restrictive assumptions about certain cases which limits their applicability.[15]