Conclusion
Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members. All the users are anonymous, hence the users can post/comment without any hesitation (except for hateful or vulgur posts which are banned by mods).
There are many subreddits on this domains ranging from news/politics to gaming and funny content. For the purpose of this analysis, I have selected the Public Freakout subreddit. Public Freakout is a subreddit dedicated to people freaking out, melting down, losing their cool, or being weird in public. The tagline of each post is a descriptive text regarding each post, and these taglines can be extracted to understand and work on this subreddit. People love consuming daily content, be it hilarious or silly or be it some informative or serious issue around the world. The users of this subreddit might use it to stay updated or scroll through some funny videos.
The analysis is divided into three parts : Exploratory Data Analysis, Natural Language Processing and Machine Learning. All the business questions revolve around these three processes.
EDA is a necessary step to understand the data and explore the variables before deep-diving into the analysis. Some basic graphs were plotted to understand the Reddit data and look at the patterns and abnormalities.
As explained above the EDA is a vital key of any data science project, there were some interesting questions that were explored in this section. Some of the intermediary questions could not be plotted because of the limitations of this dataset and different columns than the requirement for the question. Reddit data analysis is performed on a very high volume of data (18 million + rows) and it is important to understand the structure and integral pieces of the data before conducting any analysis.
The relationship between the score and the length of the comment is very erratic. Comparatively, the average score is low if the length of comment is small. If the author is aiming to increase his score, he should target to keep his character count more than 3000+.
The most active users and their controversiality is plotted to gain insights on the activity of the reddit users in the PublicFreakout subreddit. It is observed that, for the most active user, even if the score of the comments increases, the number of awards does not increase and is still consistently zero.
Moving on to the activity of the users, the timeframe with the highest activity (hottest comment time) is between 3pm and 7pm. Now to the bigger picture of the timeline, there is a spike in the frequency of comments after March 2020, this is the exact time when covid hit. This spike makes sense as people were clueless regarding the situation and freaking out and must have found Reddit the perfect platform to vent out their emotions being anonymous. After this jump, the number of Reddit users increases. A new dataset is joined to explore the impact of Covid on the activity of the Subreddit.
The next section in our analysis is NLP. Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken and written -- referred to as natural language.
We found out the sentiment of each comment and tried to conduct our analysis revolving the sentiments. We looked at the fluctuation of sentiments in the comments and the relation between the different flags (Pandemic Freakout and Arrest Freakout) and the sentiment.
The last section of this analysis is Machine Learning. Machine learning algorithms use historical data as input to predict new output values.Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
In this section, we tried to predict the score of the comment using different features. Two models were created to predict the score of the comment, comparatively less features are used to predict the score and hence it cannot be accurately predicted. With the available comment data, it is difficult to predict the score value accurately. The second business question that I tried to answer was predicting the controversiality of a comment using features of the reddit comment. Two models were used to predict the label, both have pretty poor accuracy as accuracy in 60s to predict a binary label is almost like flipping a coin. To improve the prediction, either a different model can be used or shuffling the predictor variables.
Future Prospects
The future plan is to identify the locations where the people are freaking out the most by extracting the main post from reddit api. On these post we can use NER models to extract the GPE and perform analysis to answer this business question. The next section that we can explore is to determine the category of the reddit posts. The posts on this subreddit are divided into different sections or flairs (Eg. Happy Freakout, Pandemic Freakout, Karen Freakout, etc.). We can try predicting which flair the post will be categorized in. For the next section, we can determine if the posts are relevant to the world affairs of a specific timeframe (Eg. Russia/Ukraine war).Pull out specific dates of the posts and match it with the current affairs going on around that timeframe. The final prospectic aspect that we can explore through this project is to check if a negative connotation point towards a specific class of people. For this we can provide a list of keywords to the machine learning model and based on texts, they will be categorized into different classes of people, once that is determined a text analysis will determine if a post belonging to a specific class is negative or positive.