scikit-learn pipelines have been enormously helpful to me over the course of building a new sentiment analysis engine for Earshot, so it’s time to spread the good news. Read on for more information on what we’ll cover in this series, what requirements you’ll need, and more badly executed attempts to use the rule of three.

This is Part 1 of 5 in a series on building a sentiment analysis pipeline using scikit-learn. You can find Part 2 here.

Jump to:

Part 1 - Introduction and requirements

As a data scientist at Earshot, I’m always looking for ways to cut through the noise of social media and get to posts that are interesting for our clients. One such signal that we use to determine relevance is the sentiment of a post - is this user expressing positive or negative feelings? If an influential person is talking unfavorably on social media about you, you want to get out ahead of it quickly.

Previously, we were using various off-the-shelf packages and APIs to get sentiment on our posts. We found their results to be somewhat dissatisfactory. These third party sentiment solutions did not perform well on social media posts. As we dug in to figure out why, some were trained on longer texts - Yelp reviews, IMDB reviews, Amazon reviews, etc., some did not handle emojis at all, and some were based on predefined dictionaries of specific adjectives that couldn’t take into account the way slang and the English language changes. Because of this, we thought “maybe we can do better”.

We could, and we did.

By creating a sentiment analysis engine that is attuned to the “unique” way users express themselves on social media, we’ve created a solution that works better for our customers.

How did we do it? With blood, sweat, and tears – and a little TLC. More specifically, we used the magic of scikit-learn pipelines to help rapidly build, iterate, and productionize our model. They’re unimpeachably awesome, and if you’re using Python for machine learning and not using these things, your life is about to get much easier. Come with me as I show you how to build a pipeline and add all sorts of fun* steps to it!

_{*Fun not guaranteed}

What are we going to cover?

Basic scikit-learn pipeline building
Adding custom functions as pipeline steps for text preprocessing
Adding custom functions as additional features
Using GridSearchCV to search for optimal parameters for each step

What are we not going to cover?

Machine learning basics
Machine learning with text basics
What is sentiment analysis?
How to scrape Twitter data (though the code is available in the repo)
Where babies come from

Requirements

First things first - we have to make sure you have everything you need to do this thing. You’ll need the following packages for maximum enjoyment:

pandas
NumPy
scikit-learn (>= 0.18)
Twython (for getting Twitter data)
NTLK
pandas_confusion (for neato confusion matrices)
Jupyter (if you want to run the notebooks yourself)

Code

There are several helper functions that we will refer to in sklearn_helpers.py and fetch_twitter_data.py. If you want to follow along at home, clone the repo from GitHub and run the Jupyter notebooks found therein. Note - you will need to modify fetch_twitter_data.py with your Twitter credentials in order to download the data.

Ready?

Let’s go to part 2!