2. Create Reddit API Account
3. Scraping Reddit post
4. Scraping Reddit subreddits
5. Cleaning the data
With the sharp rise of data, it is only going to get better to scrape, gather, collect, amass, and many other equally meaningful words of all sorts of information from multiple sources such as Facebook, Twitter, and Reddit. With that in mind, Reddit for so long has had an API called the Python Reddit API Wrapper, shortened for PRAW, using python (as already in the name I know!) to crawl data. …
2. Using TextBlob
3. Using local classifier
A recent predicament I have crossed recently is the lack of suitable datasets or corpus to train your model on. Sometimes even when having sufficient data if it’s not labeled then it does little benefit. Usually, the go to solution for something like this is unsupervised learning in order to cluster or group them into classes, thus creating the dependent column one wishes for. …
2. Create Twitter API Developer Account
3. Scrape Hashtag
4. Scrape Tweet Conversation
For the most part, when anyone is anyone when thinking of data and scraping it, Twitter mostly if not always pops up to mind. If you’re reading this article I’m more than sure you don’t need me to tell how twitter is this huge treasure equivalent when it comes to raw data and so forth. Nor will I either bombard you with statistics (even though they are super exciting but nonetheless not today’s subject) or quotes from industry leaders on just that. …
K-nearest Neighbor (KNN) is a supervised classification algorithm that is based on predicting data by finding the similarities to the underlying data. KNN is most widely used for classification problems, but can also be used to solve regression problems. The original assumption is the data exist in forms of clusters or exist in close proximity. KNN is a non-parametric algorithm, which means it does not attempt to make an assumption on the data, for example it does not care if the data is normally distributed or not.
So how is KNN implemented? Let’s go over how it now:
1. Initialize a…
Naive Bayes algorithm is another very popular supervised machine learning algorithm that attempts to classify data based on the Bayes Theorem, according to predefined classes. A classical example that you’ll find Naive Bayes in is text classification such as detecting spam from normal emails. Naive Bayes is also called a probabilistic classifier as it calculates the probability of any certain data to belong to which class.
The term Naive Bayes itself is an interesting one, and one you ought to know. The term Naive itself refers to the fact that the algorithm assumes that all the features are independent of…
When diving into supervised machine learning for the very first time, one usually interacts with logistic regression quite early on probably after learning about linear regression. And for good reason, that is, logistic regression whether it be binary or multinomial is very similar to linear regression. Logistic regression is used to predict the probability of any certain value belonging to either one of 2 categories or one of the multiple categories. So the biggest difference between linear and logistic regression is: Linear regression assumes the data is of linear variability, normality, and homoscedasticity. To put in simpler terms the data…
Linear Regression is probably the first algorithm you will encounter when starting out your Data Science learning journey — and guess what? So did I and most probably everyone else. It’s one of the least complex if not the actually least complex algorithm to understand and apply.
Now let’s start with math, I know this is not the fun part of Data Science overall but it is as important to grasp as anything else.
Let’s break down this equation into chunks one could bite:
4. Dynamic undersampling and oversampling
One could easily find all the datasets, corpora, and so forth and so forth and most of them if not all of them are gonna be in pristine and perfect condition. No null values, balanced classes, large amounts of data, every data scientist's version of the perfect dataset. Alas, this is more or less is a rarity, almost what Egyptians would call a “4th impossible” one could say. Undersampling and oversampling are techniques used to combat the issue of unbalanced classes in a dataset. We sometimes do this…
Data Engineer @ IBM, and hopefully a future full-stack Data Scientist. This blog is part of my learning journey, maybe it can be part of yours too.