Reddit Analysis - NLP

Initializing Spark Session

Reading the entire dataset from s3

Data Text Checks

Cleaning is required as stopwords cannot be the top 10 words with highest count

It can be observed that some of the texts with the higher comment length are not useful for analysis (deleted etc). These words/sentences can be removed later.

Important words according to TF-IDF

Creating Dummy Variables

Cleaning the data

Finding the Sentiment of comments

Selecting the required comments to store in s3

Graphs - Business Questions

Sentiment Count

There are very few comments overall having neutral sentiment compared to the other two sentiments. Majority of the comments have a positive sentiment, this might be the case as the authors might leave motivational or happy comments under freakout videos or else there are comparatively more video under happy freakout category.

Sentiments through time
Sentiment of authors around covid
Summary Table
Summary Table