scikit-learn pipelines have been enormously helpful to me over the course of building a new sentiment analysis engine for Earshot, so it’s time to spread the good news. Read on for more information on what we’ll cover in this series, what requirements you’ll need, and more badly executed attempts to use the rule of three.
This is Part 1 of 5 in a series on building a sentiment analysis pipeline using scikit-learn. You can find Part 2 here.
Jump to:
- Part 2 - Building a basic pipeline
- Part 3 - Adding a custom function to a pipeline
- Part 4 - Adding a custom feature to a pipeline with FeatureUnion
- Part 5 - Hyperparameter tuning in pipelines with GridSearchCV
Part 1 - Introduction and requirements
As a data scientist at Earshot, I’m always looking for ways to cut through the noise of social media and get to posts that are interesting for our clients. One such signal that we use to determine relevance is the sentiment of a post - is this user expressing positive or negative feelings? If an influential person is talking unfavorably on social media about you, you want to get out ahead of it quickly.
Previously, we were using various off-the-shelf packages and APIs to get sentiment on our posts. We found their results to be somewhat dissatisfactory. These third party sentiment solutions did not perform well on social media posts. As we dug in to figure out why, some were trained on longer texts - Yelp reviews, IMDB reviews, Amazon reviews, etc., some did not handle emojis at all, and some were based on predefined dictionaries of specific adjectives that couldn’t take into account the way slang and the English language changes. Because of this, we thought “maybe we can do better”.
We could, and we did.
By creating a sentiment analysis engine that is attuned to the “unique” way users express themselves on social media, we’ve created a solution that works better for our customers.
How did we do it? With blood, sweat, and tears – and a little TLC. More specifically, we used the magic of scikit-learn pipelines to help rapidly build, iterate, and productionize our model. They’re unimpeachably awesome, and if you’re using Python for machine learning and not using these things, your life is about to get much easier. Come with me as I show you how to build a pipeline and add all sorts of fun* steps to it!
*Fun not guaranteed
What are we going to cover?
- Basic scikit-learn pipeline building
- Adding custom functions as pipeline steps for text preprocessing
- Adding custom functions as additional features
- Using
GridSearchCV
to search for optimal parameters for each step
What are we not going to cover?
- Machine learning basics
- Machine learning with text basics
- What is sentiment analysis?
- How to scrape Twitter data (though the code is available in the repo)
- Where babies come from
Requirements
First things first - we have to make sure you have everything you need to do this thing. You’ll need the following packages for maximum enjoyment:
- pandas
- NumPy
- scikit-learn (>= 0.18)
- Twython (for getting Twitter data)
- NTLK
- pandas_confusion (for neato confusion matrices)
- Jupyter (if you want to run the notebooks yourself)
Code
There are several helper functions that we will refer to in sklearn_helpers.py
and fetch_twitter_data.py
. If you want to follow along at home, clone the repo
from GitHub and
run the Jupyter notebooks found therein. Note - you will need to modify
fetch_twitter_data.py
with your Twitter credentials in order to download the
data.
Ready?
This is Part 1 of 5 in a series on building a sentiment analysis pipeline using scikit-learn. You can find Part 2 here.
Jump to: