Hi, we are Dengvaxia Scare

And this is our project, clustering the tweets that spread misinformation related to Dengvaxia.

In an effort to combat disinformation on social media, this project aims to categorize tweets based on their content and analyze which content type was the most prominent in spreading misinformation.

Data Science Team

  • Blueming Dan Moneda, CS132 WFV
  • Luis Miguel Senatin, CS132 WFV
  • Christian Karl Verdad, CS132 WFV
About the Team

Overview

Motivation

The Dengvaxia Scandal that started in 2017 caused an influx of misinformation to spread across Twitter.

Problem

We wanted to find out which type of misinformation content was being spread around the most.

Solution

Use a clustering machine learning algorithm to categorize the tweets into different content types.

Problem Formulation

Research Problem

What content in tweets contributed the most to increasing the number of Dengvaxia misinformation tweets?

Hypothesis

Tweets with content about the deaths of children caused by the Dengvaxia vaccine contributed the most to increasing number of Dengvaxia misinformation tweets.

Null Hypothesis

Every type of content in tweets equally contributed to the increasing number of Dengvaxia misinformation tweets.

Action Plan

Categorize Dengvaxia misinformation tweets by their content type, then analyze the frequencies and posting times of each category.

Data Collection

Twitter advanced search

Tweets were manually collected using Twitter's advanced search function. We searched for the keyword "Dengvaxia", and combed through tweets ranging from the year 2017 to 2020.




Dataset

In total, we were able to gather a total of 150 misinfomation tweets about Dengvaxia.

Data Science Methodology

Results

After performing visualization, statistical modeling, and machine learning, these are some of the interesting results.

Some words and terms are clearly dominating the vocabulary of the whole tweets: these are dengvaxia, child, vaccine, kid, and died, which make up 56.39% of the word frequency of the 150 tweets.

After performing Kruskal-Wallis Test, we found out that there are no significant differences in any measure of engagement, i.e. likes, retweets, replies, between different topics. Thus, we cannot prove our hypothesis that "Tweets about children's deaths resulted in a higher average amount of engagement based on the tweet topic."

After carrying out Latent Dirichlet Allocation (LDA) for topic clustering, we found that death among children still dominate the misinformation tweets about Dengvaxia despite the topic not having a significant difference in engagement numbers compared to other topics.