Using NLP to read mean YouTube comments

Blossom Onunekwu
18 min readJan 15, 2021

We did it, Joe! We built our first Natural Language Processing model!

Popular Kamala Harris meme

This is a storytorial (story + tutorial) on how I built my first NLP model.

NLP is short for natural language processing. It’s a machine-learning algorithm commonly used for analyzing words. You turn words into numbers (through a process called vectorization!) and then use the model to help you run analyses of the words.

I had learned about NLP from the Udemy course I was taking (you can read about that here), and I had a unique idea for demonstrating what I learned. The Udemy video was great, but for what I wanted to do, it wasn’t nearly enough. I wanted to analyze mean comments on my YouTube video.

In my free time, I like to create YouTube videos. It would be nice to be a big-time influencer one day, but right now I’m just enjoying making one-woman sits and watching myself grow throughout the years. I can count on one hand the number of viral videos I have.

Last year I made a video that EXPLODED, and instead of the typical comments like “you’re hilarious! 🤣🤣”, the comments section was a bit more…let’s say…disappointed.

Because of this, I refrained from reading the comments of the video, because in the beginning, most of it was all so very negative.

One day I accidentally clicked on a notification from the video.

I read the comment.

It actually…wasn’t as bad as I thought it would have been.

That inspired me to go back and see if there are more comments from people that actually understood my point in the video.

But because many of the top comments were negative from the last time I checked, I really didn’t want to sift through aaaall that negativity. So, I let my NLP do it for me, and here’s how I did it.

This process took approximately….forever. And even though I’m really proud of myself for finishing this, I have learned that I kinda bit off a bit more than I could chew with data science. And I need to start from the beginning.

The HOW and WHY of my NLP

We need to understand what the heck we are actually trying to accomplish.

So in reality, we want to read YouTube comments. You only want to turn those YouTube comments into vectors so that our machine learning algorithm can decipher those comments and do what we want it to do with those comments.

We want to group our comments into different topics to see if we can find underlying themes in the bloodbath — I mean, the comment section.

So that means that we need to go to YouTube. But instead of us downloading or copying and pasting each and every single YouTube comment under that video (which had over 150 comments), we need to work a little bit smarter and not so much harder.

A brief lesson on APIs.

In order to do that we’re going to use something called an API, an Application Programming Interface.

Just for giggles, let’s look up what an API is:

“An application programming interface (API) is a computing interface that defines interactions between multiple software intermediaries.” From mighty Wikipedia.

Live footage of myself after reading that definition.

Live footage of myself after reading that description.

I don’t like that definition. So here is my own.

An API is a middleman for you to get the information that you need from a software or program without manually exploring the entire software (in this case YouTube).

API is kind of like having your mini YouTube inside of your Jupyter notebook. It’s kind of like going into the backdoor to steal the PS5 in the basement of your neighbor's house versus going through the front door, greeting all the family members, going downstairs and *inconspicuously* take the console and run with it.

So instead of me going to YouTube.com to sift through all those YouTube comments, I’m going to the back end and with a small query search, I can find that YouTube video, scrape all the comments, and put them into a spreadsheet.

Obviously, it’s not going to be that simple since it took me almost 3 months to do this project, but you get the gist.

But seriously, APIs are important. If you ever wanted to make an app that requires Google Calendar, you’re going to need a Google API. If you want to include some sort of app that already exists into something you create, then you’re going to need that app’s API.

So if you think about it, waitresses are APIs because they deliver your food that you ordered. You don’t have to go up to the actual cook to ask when your food is going to be ready.

This was actually my first time using an API (can you tell?). And I don’t quite exactly know the ins and outs of accessing the YouTube API, so here is the tutorial that I followed to help me do that for the first time.

Although, one thing I’m learning about machine learning tutorials online is that, when you try and apply what you learned to your own unique project, it’s never as simple as copy and paste. Never.

The data is never as cute and clean, and the models need more tuning.

However, for accessing the Google/YouTube API. It actually was as easy as copy and paste. So here you are.

Just don’t ask me to explain this to you, for that is not my ministry just yet.

Once you run this, two links will populate, and they’ll ask you to log into your GMAIL account to access the API, which I thought was cool. At first.

Once you have established the connection between your notebook on Google Colab and the API, you now have a mini YouTube at your disposal.

Next, we’ll need to use a query to search through YouTube and find the video of interest.

Next to the query is the exact name of my video. I followed along with the tutorial I shared earlier to show exactly how to get all the comments on one CSV.

But to provide a high very high overview of what exactly is going on here, first I’m doing a search on the name of my video.

Once you run the code, you will get a gigantic dictionary of dictionaries within dictionaries. If you look close, there are actually five dictionaries, representing the 5 first search results. The default of the API only gives you 5 videos.

The video of interest is the first video, indexed at 0.

Then we use multiple for loops to capture the different features of the video:

video id, comments, reply count, and more.

These for loops extract the necessary data and group them in a manner that makes sense to me.

Because I have lots of comments *subtle brag*, we use a while loop to continue this process while going through every “page” of comments. One line of code to note here is

if nextPage_token is None:
break (line 30)

(In English,) if there is not another page, stop the for loops!

I use the function dataframe_to_csv to convert the data frame to a CSV called “comments”.

Cleaning the data

To clean the data, I first use this library called demoji, which takes away all the emojis in the comments. My CSV right now is called “comments” and that was created in the prior process.

I also removed special characters. I created a new data frame called copy which was a copy of the comments.

I used a Lambda function to apply my cleaning to every comment there and then I added a column called clean comments that reflected this new change and comments. But I ended up renaming that column to regular comments.

As you can see here, there’s a lot of information that we actually don’t need. Replies, likes, and “reg” to name a few. Although in hindsight, I could have removed Video ID and Comment ID as well.

I created a new copy of this. I am taking just a few features of the copy (which was originally the comments.csv). And I renamed the “regular_ comment” column to “comments” for simplicity.

Now we’re going to remove stop words. Stop words are words that are pretty much overused in the American vernacular but are important in speech. However, they’re not so important in determining what the sentence is trying to say. For instance the words and, or, for, you are defined as stop words. So we’re going to remove them.

I do a similar thing with the Lambda function to remove all the stop words, and then I create another column called “no stop” which will represent the comments without stop words.

We will now convert this new data frame to a CSV. I’m not too sure why we do this again to be quite honest with you, but we did. I think it would be easier to obtain the comments, clean them, and then save them as a CSV instead of having two CSVs.

You might be wondering, why do we need CSVs in the first place? Couldn’t we have just saved everything in a data frame? Keep reading, you’re in for a wild ride.

But I digress. This new CSV will be called “dataset.”

Calculating polarity with sentiment analysis

Sentiment analysis and calculating polarity go hand-in-hand. To do this, we’re going to use this module called TextBlob.

I’ve actually heard of sentiment analysis before learning about data science. I’m a social media manager for health businesses, and I use HootSuite to get the job done. And HootSuite has a feature that detects the sentiment of your messages. Most times it can be a positive, negative, or neutral sentiment depending on the word choice.

So for instance, if I use a word like death, or I use a word like sickness or health, a lot of the time, the machine learning algorithm used in HootSuite will see it as negative.

the finished “dataset” dataframe

So here, I used Text Blob to measure polarity and I add a new column to the data frame and call it polarity. Of course, since this is just a machine there are going to be times where it is wrong.

In the Google Colab notebook, you can see that I do hone in on a few comments to fully read them and see if TextBlob captured the correct polarity. Some if not many were incorrect.

Moving forward, it would be nice to do some type of cross-validation or some Type 1/Type 2 testing to see how accurate the sentiment assignment was.

Line 22 is marked as positive, but I think it’s more so negative within the context. The algorithm might have caught the word “wow” which can be seen as a positive word or surprised or shocked in a good way so, that makes sense why it could be rated as positive even though it might not have been.

Exploratory Data Analysis

I honestly didn’t do As much EDA as I could have because I was just really interested in seeing what the word clouds and the sentiment analysis would look like. Like I said, I was trying to figure out if the comments were as negative as I thought they would be.

Because of this, you will only see a few histograms or bar charts in this entire Jupiter notebook. And I chose a histogram because I wanted to see what the polarity looked like using a bird’s-eye view.

histogram of polarity

In this histogram, you will see a range of -0.75 to about 1 with a slight skew to the right.

There are indeed some comments that were perceived as negative, but this skew supports that many of the comments were positive, or at least neutral.

There’s a large number of comments with a polarity of 0.00 meaning the algorithm perceived many comments to be neutral or slightly positive (one comment’s polarity registered as 0.07)

Which, honestly, is nice to hear at face-value. But we all know the polarity library used could definitely have assigned the wrong polarity.

And now to the cherry on top: the word clouds!

Word clouds are nice but…

Side note — do word clouds count as EDA?

Now that I have the polarity, and I took a quick glance at the comments to back up the polarity, I wanted to make a word cloud of all the comments that I have gathered from the video. So I found the script on stack Overflow I believe and I built this word cloud.

my very first word cloud.

You can see the top words include “video”, “Erin”, “success” , “think”, and “feel”. Does that tell a story to you? Because it doesn’t to me.

To be honest, I created this word cloud specifically to show off. Look at me, I created a word cloud! But once I got the word cloud and running, I realized that it didn’t really tell a story. It kind of already told us things that we already knew or at least things that I already knew. So I turned to topic analysis.

Breaking up the comments section into topics

Topic analysis or Topic modeling would be able to break down all the comments into a certain number of topics. From there, we can find themes in the comments which can better tell us the different opinions and feelings that were expressed in the comments section. Because the fact that the sentiment analysis gave the majority of the comments neutral or positive ratings is a little…sus given that this video was heavily disliked. (Sus = suspicious for those of you who have lives and don’t play Among Us).

So I converted the word into vectors and built an LDA model. LDA stands for Latent Dirichlet allocation, and I have never said this word out lout so please don’t ask me to pronounce it for you.

And it is used to learn the representation of a certain number of topics.

Normally it’s used for documents. Let’s say you have a multitude of articles from Wikipedia and you want to group these articles based on the contents within These articles.

An LDA model will assign a topic to each of these articles and even give you the probability distribution of how likely that article belongs to that topic.

In my instance, we don’t have articles, but we do have comments. And I just wanted to see the top three topics of the corpus (AKA body of words) or comments.

I told my LDA to give me five topics with 10 words to describe each topic. And it birthed this.

Topics found via LDA:

Topic #0:

erin love don ve just way hard channel feel good

Topic #1:

like just feel video people understand totally good ve think

Topic #2:

video feel erin like success need don just think people

Topic #3:

video just erin don love know like lol look different

Topic #4:

erin video just people love like did great business title

And once I got the topics I realized something.

Yes, this is actual code in my Jupyter notebook. At least for now.

I was expecting cohesive topic sentences to help me learn about the main ideas captured in the comments section.

Something like

Topic 1: “This is clickbait please stop.”

Topic 2: “ Hater take this down”

Topic 3: “Delete this”

(Although, since I removed stop words, they probably wouldn’t have been this cohesive either.)

The words “Erin” , and “video” appeared in almost all the topics and it was really hard to discern which topic was what. So I made a new count vectorizer and LDA and I made sure to remove words that have 90% frequency and more.

Everybody knows this video is about Erin (it’s in the name of the video) so there really isn’t any need for that word to be a part of these topics.This is telling us things that we already know.

Instead of getting topics in the form of 10 words from the LDA, this time around, I just programmed the model to share the top 50 words in each topic. And this time, I broke it down to 3 topics.

top 50 words in each topic

I like to test things out as I create them. So I printed out the first comment to see the probability distribution of that comment. From here, you can see that there are three decimals. These represents the probability of that word being in the three topics. Tthe model decided that the first comment best belonged to topic 2 since that probability is the largest.

Next, I added a new column to the data frame and titled it “topic”. From here I would add all the assigned topics to each comment. This data frame just makes it look a bit neater so that you can see what comment was assigned to which topic.

Because I didn’t really like the way LDA created the topics, and didn’t spend enough time tuning this new machine, I decided to create the actual topics myself using, you guessed it, more word clouds.

Get ready for the MOST time consuming part of this NLP project.

So for each topic, I created an individual word cloud. From there I would double-check to see the context of the most popular words, and create my own topic sentence for them based on that information.

I had a lot of trouble with this part because the original stack overflow code that I used to create the first word cloud was not going to work for three different work clouds.

I found a code that cycled through each word cloud topic using a for loop and I thought that was what was going to help me.

But after chatting with my mentor, we realize that the word cloud wasn’t really accurate. Normally in a word cloud, the largest words are the words you see with the most frequency. But that wasn’t the case in my case. And I realize that way too late.

So after scouring the internet for a word cloud code that worked correctly and failing, I took matters into my own hands.

Word clouds work off of a corpus, a body of words.

We already have one big body of words — the comments section. But since I had three different topics, I would need to create three different bodies of words.

In order to do that I went through each topic, and combined all the comments into one string. From there I fed that string into the word cloud.

Is this hard coding? Yes. Did it work though? Also yes.

There are probably more sophisticated ways to create a word cloud using different topics within a giant corpus, but this is the way that accurately worked for me. I even added a function so that users potentially could search for a word that they see in the word cloud and the function would return its frequency. So if “dinosaur” is the biggest word there, you could use the function and it’ll tell you how many times the word appeared in the corpus.

In my first take of the word cloud, I would get giant words but with a little digging, I saw that they were only featured 3 or 5 times in the word cloud.

My “hard-coded” technique is a cheap and easy way to address accuracy when it comes to what is being represented in the word clouds.

I also removed the most popular words we found in the first word cloud (Erin, video, feel, etc). Once again, I wanted to really discover things I didn’t already know.

I ran the script and out came three beautiful word clouds for the three topics.

And then the next day, all of them were gone.

All that time I spent in learning to hard code and creating a search function to check validity — wasted.

Till this day, I still don’t know how my Google Colab document, which was saved to my Google Drive, a CLOUD service, did not automatically save my work. Crazy, right?

So I did it again after maybe a week later once I recovered from heartbreaking disappointment.

And out came my three beautiful word clouds again.

And no conclusion.

The whole point of this project was to figure out if the comments section was as bad as I thought it was using sentiment analysis and topic modeling.

But once I created the wordclouds, read all those unhappy comments to check for context, and created the search function to check the word cloud’s validity, I didn’t even bother to analyze what story the clouds were telling me.

After fighting the internet to figure out why oh why my word cloud codes were not working correctly, figuring out how to hard code for the three wordclouds, losing that script and then having to remember how to do it all over again, I was pooped. I had to create word clouds at least five times for this one project.

If I get asked to make a word cloud one more time this year, I think I’m going to hurl.

While I didn’t dive in to answer my own research question, I still learned a lot from this project and its many issues.

Issues with my NLP project

For one, there is no conclusion. The whole point of this project was to determine whether or not the comment section was as lethal as I anticipated.

In order for me to do that ideally I would have to read each of these word clouds and create a cohesive topic sentence that captured the theme of each wordcloud. But I didn’t really take the time to do that because like I said I was a bit frustrated. Actually frustrated is an understatement.

Not just frustration, but I did come across a lot of confusion. Which leads me to my next issue.

To me, the whole point of a word cloud is to be able to portray the story that’s being shared within it.

What happens when the story keeps changing?

Every time I ran the notebook, the word clouds would change words. And I didn’t realize it at first until I sat down and tried to create my own topic sentences for each of the word clouds based on the words that I saw.

The next day when I ran the notebook again, I noticed that the word clouds didn’t really match up with the description or the topic sentences that I had created for them.

It wasn’t until very late that I realized that every time I ran the Colab notebook, the api would be accessed, the comments would be updated (because I still do get comments on that video), and therefore the word clouds would be changed.

So in the end, I would have a totally new word cloud from the last time I ran the notebook. And when I realize this, I was infuriated.

The only reason I wanted to keep the API code in the notebook is so that I can show you all that I’ve accessed anAPI for the very first time.

But I should have accessed it once and then used only the CSV that was created once I got all the information I needed. This is why we needed a CSV.

I’ll probably go back in and assign topics to the word clouds. And I’ll probably clean up my code as well because it is messy, I’m not even going to lie. But I didn’t want to spend another minute on this incredibly arduous project before sharing to the world that I had attempted it.

I received a lot of help making this. A lot more than I wanted.

While I learned a lot about natural language processing, sentiment analysis, and word clouds, a big takeaway here is that I need to work on the fundamental skills of data science.

That way, I can better understand what it is that I’m doing and why. I definitely need to start programming and coding a little bit more.

And it would also be nice to learn the logistics behind these algorithms, especially the polarity one. I wonder if there’s any way to measure its validity or to fine-tune it. I definitely need to take a few steps back from the machine learning algorithms and see if I can get my hands wet in regular programming and even regular statistics. I think I need to understand what these machines are mathematically doing.

Data science isn’t just machine learning algorithms, contrary to my own belief.

So please everyone, feel free to leave in the comments questions that you have and also courses or videos you think I should watch to better improve my understanding of the fundamentals of data science!

Here is the link to the full repository for this project. Thanks so much for making it this far!

Oh yeah, and if you haven’t already figured it out, this is the viral video I made by which this project was inspired. It’s a bit cringy, and wasn’t at all well received. But I’m thankful for the “mean” comments/tough love. Without it, you wouldn’t be reading this storytorial!

Don’t forget to follow me on Twitter or subscribe to my YouTube channel where I’ll be talking more about my life as a Health communication specialist who enjoys exploring data science.

My next video will be the video version of this blog post. And there, I’ll actually answer my research question! See you there.

--

--

Blossom Onunekwu

I'm a college and health blogger and freelance writer with a passion for food and sanity. Have a laugh or two with my witty, informative posts!