We have all these delicious preprocessing steps, feature extraction, and a neato classifier in our pipeline. Now it’s time to tune this pipeline.
This is Part 5 of 5 in a series on building a sentiment analysis pipeline using scikit-learn. You can find the introduction here.
Jump to:
- Part 1 - Introduction and requirements
- Part 2 - Building a basic pipeline
- Part 3 - Adding a custom function to a pipeline
- Part 4 - Adding a custom feature to a pipeline with FeatureUnion
Part 5 - Hyperparameter tuning in pipelines with GridSearchCV
We’ve come so far together. Give yourself a hand.
Alright, ready to finish things up? Let’s move on to the final entry in this series. We’re going to do a parameter search to try to make this pipeline the best it can be.
Setup
In [1]:
got 92 posts from page 1...
got 88 posts from page 2...
got 88 posts from page 3...
got 91 posts from page 4...
got 87 posts from page 5...
got 89 posts from page 6...
got 95 posts from page 7...
got 93 posts from page 8...
got 86 posts from page 9...
got 90 posts from page 10...
got all pages - 899 posts in total
CPU times: user 853 ms, sys: 348 ms, total: 1.2 s
Wall time: 7.7 s
Construct the pipeline
This is the same pipeline that we ended up with in part 4.
In [2]:
Searching for golden hyperparameters
One really sweet thing that scikit-learn has is a nice built-in parameter search
class called GridSearchCV. It
plays nicely with the pipelines too. First we’ll construct our parameter grid
and instantiate our GridSearchCV
.
In [3]:
Now we’re ready to perform that grid search. It’s going to take a while (~3 minutes on my laptop) so kick back and relax. Or pace around and be tense. I’m not going to police the way you spend your downtime.
In [4]:
Fitting 3 folds for each of 360 candidates, totalling 1080 fits
null accuracy: 40.44%
accuracy score: 64.00%
model is 23.56% more accurate than null accuracy
---------------------------------------------------------------------------
Confusion Matrix
Predicted negative neutral positive __all__
Actual
negative 29 15 20 64
neutral 8 43 19 70
positive 7 12 72 91
__all__ 44 70 111 225
---------------------------------------------------------------------------
Classification Report
precision recall F1_score support
Classes
negative 0.659091 0.453125 0.537037 64
neutral 0.614286 0.614286 0.614286 70
positive 0.648649 0.791209 0.712871 91
__avg / total__ 0.640928 0.64 0.632185 225
[Parallel(n_jobs=1)]: Done 1080 out of 1080 | elapsed: 2.9min finished
And now we’ll print out what hyperparameters the search found made the best model:
In [5]:
used CasualTokenizer with settings:
preserve case: False
reduce length: False
best parameters: {
"features__vectorizer__ngram_range": [
1,
1
],
"features__vectorizer__max_df": 0.25,
"classifier__C": 0.10000000000000001,
"features__post_length__kw_args": {
"active": true
},
"genericize_mentions__kw_args": {
"active": true
}
}
Thanks, GridSearchCV
! You can also build your own custom scorers for use in
parameter grid searches in case you wanted to optimize for a particular metric
(such as negative recall), but that’s a subject for another time.
Now to summarize what we learned
Well, we’re finally at the end of the series! You’ve learned so much - just take a look at this list:
- How to build a basic data pipeline
- How to add text preprocessing inside a pipeline via FunctionTransformers
- How to add new feature columns using FeatureUnion and some funky FunctionTransformer stuff
- How to run a cross-validated parameter grid search on the pipeline
So there you have it. A functional, living, breathing scikit-learn pipeline to analyze sentiment. Keep building on to it, adding preprocessing steps, new metafeatures, and tweaking hyperparameters.
Hope this was helpful, and thanks for reading.
Until next time,
Ryan Cranfill
Thanks to
Dylan Lingelbach, Gordon Towne, Nathaniel Meierpolys, and the rest of the crew at Earshot for all the help along the way.
This is Part 5 of 5 in a series on building a sentiment analysis pipeline using scikit-learn. You can find the introduction here. You can’t find more parts because they don’t exist.
Jump to: