Data Visualization

This section details how the team visualized the data after the data preprocessing phase.

Outline

  1. Frequency: Vocabulary
  2. Frequency: Rational vs. Emotional
  3. Frequency: Tweet Ratings
  4. Comparison: Annual Tweet Frequency
  5. Comparison: Account Age
  6. Comparison: Account Age vs. Date Posted
  7. Correlation

Frequency: Vocabulary

One of the most important factors to look at when analyzing misinformation tweets is the most frequent words and terminologies people mention in these data. Identifying these vocabulary allows us to better understand the main drivers why people publish misinformation.

	import nltk
	from nltk import word_tokenize
	from nltk.probability import FreqDist
	from matplotlib import pyplot as plt
	
	# Display the total number of words present
	token_temp = tweets_lemmatized.copy()
	words = " ".join(token_temp)
	print(f"The total number of words in the text is {len(words)}")
	
	# Find the frequency of words
	fdist = FreqDist(word_tokenize(words))
	fdict = dict(fdist)
	
	set_words, count_words = zip(*fdict.items())
	set_words, count_words = list(set_words), list(count_words)
	
	# Creating a dataframe for word frequency
	df_freq = pd.DataFrame(zip(set_words, count_words), columns=["words", "count"])
	df_freq.sort_values("count", ascending=False, inplace=True)
	df_freq.reset_index(drop=True, inplace=True)
	df_freq = df_freq[:20]
	
	# Plot the most frequent words
	fdist.plot(20)
	plt.show()
 
Using line graph, we see that the word 'dengvaxia'is the most mentioned term, followed by 'child' and 'vaccine'.

Plotting the word frequency using a Sunburst Graph will allow us to better visualize the distribution of different words across all the tweets.

	n = 30
	pal = list(sns.color_palette(palette="Reds_r", n_colors=n).as_hex())
	
	import plotly.express as px
	fig = px.pie(df_freq[0:30], values="count", names="words", color_discrete_sequence=pal)
	
	fig.update_traces(
		textposition="outside",
		textinfo="percent+label",
		hole=0.6,
		hoverinfo="label+percent+name",
	)
	
	# Plot frequency of words as Sunburst Graph
	fig.update_layout(width=800, height=600, margin=dict(t=0, l=0, r=0, b=0))
	fig.show()
	
 
Now, we can easily observe that the 4 most frequent mentioned words make up more than 50% of the word frequency list across all tweets! These are 'dengvaxia', 'child', 'vaccine', and 'kid'.

Frequency: Rational vs. Emotional


# Bar Graph
import seaborn as sns

df = dengvaxia_nlp.copy()

# Plots 'Content type' frequency in a bar graph
sns.countplot(data=raw_data, x="Content type")
 
The dengvaxia controversy was a highly sensitive and polarizing issue at the time because it involved deaths of several children nationwide. Thus, it was expected that the tweets gathered would be emotional in nature, often people voicing out their views toward the government officials and expressing pity toward the alleged victims caused by the negligence of the authority.

However, we also saw a significant proportion of tweets that contain scientific terms and reasoning, albeit often misleading, to try and assert validity of their opinions. This is reflected in the 'Content type' frequency graph above.

Frequency: Tweet Ratings


# Plots 'Rating' frequency in a bar graph
sns.countplot(data=raw_data, x="Rating")
 
An overwhelming amount of tweets gathered consist of false information regarding Dengvaxia vaccines.

This might be due to the fact that people were quick to believe that there was a correlation between Dengvaxia and the deaths of the children, when in fact, there has been no definite proof published yet that would confirm these allegiations. The tweets that insist on stating that the Dengvaxia vaccine was the cause of these reported deaths contribute to false information — Not because they mislead other people, or intentionally left pieces of information that would provide context to their claims, but rather simply provided false and unproven information.

Comparison: Annual Tweet Frequency


	# Line Graph
	import matplotlib.pyplot as plt
	import plotly.express as px
	
	x_tweetfreq = [i for i in range(2016, 2023)]
	y_tweetfreq = []
	for i in x_tweetfreq:
			y_tweetfreq.append(
					len(df.loc[df["Date posted"] >= str(i)].loc[df["Date posted"] < str(i + 1)])
			)
	
	plt.plot(x_tweetfreq, y_tweetfreq, color="#76b5aa")
	plt.xlabel("Year")
	plt.ylabel("Number of Tweets")
	plt.title("Number of misinformation tweets each year")
	plt.show()
 
2018 saw the sudden rise of tweets with Dengvaxia misinformation.

This was the time when Dengvaxia vaccination cases became a nationwide scandal. Blame was put on pharmaceutical firms and government officials who initiated the Dengvaxia vaccination program. The nationwide scandal began when deaths of several children are allegedly being linked to Dengvaxia vaccines. Hence, 2018 was the year when the issue was the top of the headlines for many months.

Comparison: Account Age


	# Line Graph
	import matplotlib.pyplot as plt
	
	x_joinedfreq1 = [i for i in range(2008, 2023)]
	y_joinedfreq1 = []
	for i in x_joinedfreq1:
		y_joinedfreq1.append(len(df.loc[df["Joined"] >= str(i)].loc[df["Joined"] < str(i + 1)]))
	
	x_joinedfreq2 = [i for i in range(2016, 2023)]
	y_joinedfreq2 = []
	for i in x_joinedfreq2:
		y_joinedfreq2.append(len(df.loc[df["Joined"] >= str(i)].loc[df["Joined"] < str(i + 1)]))
	
	# Plot yearly misinformation accounts created in a bar graph
	plt.bar(x_joinedfreq1, y_joinedfreq1, color="#B57681")
	plt.xlabel("Year")
	plt.ylabel("Number of Accounts")
	plt.title("Number of misinfo accounts created each year")
	plt.show()
 
Interestingly, 2018 was also the year when accounts who posted misinfo tweets were created.

In the next section, we take a closer look by comparing these variables side-by-side in a single histogram.

Comparison: Account Age vs. Date Posted


	temp = pd.DataFrame(
		{"Joined": y_joinedfreq2, "Tweet Frequency": y_tweetfreq},
		index=x_tweetfreq,
	)

	# Plot account age vs. date posted in a bar graph
	temp.plot(kind="bar", color=["#B57681", "#76b5aa"])
	plt.title("Account Age vs. Date Posted")
	plt.xlabel("Year")
	plt.ylabel("Count")
	plt.show()
 
We can observe that 2018 was both the year where the most gathered tweets are from, and also when the majority of accounts who engaged in misinfo tweets are created. We can recall that this was the year when the Dengvaxia issue became a national scandal and a hot topic not just locally but internationally.

Thus, it is not surprising that people who want to voice their opinions on this matter created their accounts this year. It is also a possibility that these newly-created accounts were trolls who wanted to be anonymous to spread fake information without facing consequences.

Correlation


	# Correlation plot (heat map)
	import plotly.express as px
	
	df = dengvaxia_nlp.copy()
	
	features = [
		"Likes",
		"Retweets",
		"Followers",
		"Following",
		"content-type_Emotional",
		"content-type_Rational",
		"tweet-type_Text",
		"tweet-type_Image",
		"tweet-type_Video",
		"tweet-type_URL",
		"tweet-type_Reply",
	]
	corr = df[features].corr()
	
	fig = px.imshow(
		corr,
		color_continuous_scale="RdBu",
		zmin=-1,
		zmax=1,
		labels=dict(x="Features", y="Features", color="Correlation"),
		x=corr.columns,
		y=corr.columns,
		title="Correlation Between Features",
	)
	fig.show()
	
 
In the correlation matrix, we can identify a number of features that might have been correlated. Recall that a correlation coefficient of:
  • -1 means that the 2 variables have an inverse linear relationship: when X increases, Y decreases
  • 0 means no linear correlation between X and Y
  • 1 means that the 2 variables have a linear relationship: when X increases, Y increases too
The emotional content and rational content tweets are the most evident one as they have a correlation coefficient of -1, which tells us that they are mutually exclusive.

The likes and retweets features also have a high correlation score of 0.69. This makes sense as both of these elements describe the engagement rating of a tweet. Thus, if a tweet has more likes (retweets), then it's bound to have more retweets (likes) as well.

The following and followers counts have a modest correlation of 0.34. This number makes sense as they both represent the number of connections of an individual in Twitter. Thus, as the numbers of following (followers) of an account increases, the number of its followers (following) increase as well. However, we have seen a number of cases where this is not followed. For example, celebrities, media personalities, and official accounts tend to have a high disproportion between these numbers as they tend to have a huge number of followers but a few following.

Finally, the features tweet-type_URL and tweet_type_REPLY also has a correlation of -0.46. However, drawing a hypothesis about the relationship between these two categories is not meaningful as a tweet containing a URL or a person replying to another tweet is completely unrelated with each other.