Reddit Analysis - EDA

Initializing Spark Session

Reading the entire dataset

Basic Info of the Data

Data Quality Check

As we can see, there are some columns where the percentage of missing values is almost 100. We can directly drop such columns as it will not help us in the analysis.

Data Transformations

Adding new column for the exact date of the comment
Adding new column for the content of the original post

Graphs - Business Questions

Relationship between Comment Score and Length

Summary Statistic

The length (number of characters) and average score of the comments

There are few length of the comments for which the average score is greater than 2000. Hence we can subset the dataset to have a better undestanding.

The relationship between the two variables is very erractic. Comparatively, the average score is low if the length of comment is small. The maximum peak of the score is reached as the length increases. If the author is aiming to increase his score, he should target to keep his character count more than 3000+.

Relationship between Time of the day and frequency of comments

Hottest comment time

This graph depicts the number of comment during different time of the day. The timeframe with the highest activity (hottest comment time) is between 3pm and 7pm. The reddit users are very active during these hours. As the night progresses, the number of comments dip and the lowest point is around 9-10am, this is the time when people generally wake up and hence they are not active on reddit that much.

Moving further, now we looked at the time of the day, to get a bigger picture, we will look through the number of comments for each month of the timeframe of our data (2019/07 - 2021/06).

Relationship between Time Period and frequency of comments

Summary Statistic

The count and time of the comments

In the beginning of the time period, the number of comments is consistent. There is a spike in the frequency of comments after March 2020, this is the exact time when covid hit. This spike makes sense as people were clueless regarding the situation and freaking out and must have found Reddit the perfect platform to vent out their emotions being anonymous. This must have been the perfect platform to look past the whole situation and have a space to share comments. The comments drastically decrease,after October 2020, the comments are again following a steady path but it is still greater than before Covid. This might be the case as the Reddit users might have increased.

Top 10 authors with highest freq of the comment

This graph depicts the top 10 authors with the highest frequency of comments over our timeframe. The most active user is a-mirror-bot. Through the course of 2years, he has 66720 comments in total.

Checking if the Top 10 authors with highest freq of the comment are controversial

Summary Statistic

The controversialty mean and sum of the top 10 users

The user with the highest number of comments is a-mirror-bot but from this graph it can be seen that he is not the user with the highest controversiality. This implies that being active or having more comments does not make you prone to being tagged as controversial. CantStopPopping is one of the lowest active user among the top 10 but still he has the most controversiality.

Relationship between Score and Total Awards for each comment of A-Mirror-Bot (Most active author)

From the graph, it can be seen that even the most active user is not able to pull maximum awards. Even if the score of the comments increases, the number of awards does not increase and is still consistently zero.

Creating Dummy Variables using Regex

Summary Statistics

The sum and mean of Controversiality and Score for comments with different words depicting pandemic freakout

Summary Statistics

This shows the how much people were talking in relation to the number of covid cases in every month