Adding functions that execute in series with a pipeline is useful, but what if you want to create a new feature with a function?
This is Part 4 of 5 in a series on building a sentiment analysis pipeline using scikit-learn. You can find Part 5 here, and the introduction here.
Jump to:
- Part 1 - Introduction and requirements
- Part 2 - Building a basic pipeline
- Part 3 - Adding a custom function to a pipeline
- Part 5 - Hyperparameter tuning in pipelines with GridSearchCV
Part 4 - Adding a custom feature to a pipeline with FeatureUnion
Let’s learn how to add the output of a function as an additional feature for our classifier.
Setup
In [1]:
got 92 posts from page 1...
got 88 posts from page 2...
got 88 posts from page 3...
got 91 posts from page 4...
got 87 posts from page 5...
got 89 posts from page 6...
got 95 posts from page 7...
got 93 posts from page 8...
got 86 posts from page 9...
got 90 posts from page 10...
got all pages - 899 posts in total
CPU times: user 710 ms, sys: 193 ms, total: 903 ms
Wall time: 6.24 s
The function
For this example, we’ll append the length of the post to the output of the count vectorizer, the thinking being that longer posts could be more likely to be polarized (such as someone going on a rant).
In [2]:
Adding new features
scikit-learn has a nice FeatureUnion class that enables you to essentially concatenate more feature columns to the output of the count vectorizer. This is useful for adding “meta” features.
It’s pretty silly, but to add a feature in a FeatureUnion
, it has to come back
as a numpy array of dim(rows, num_cols)
. For our purposes in this example,
we’re only bringing back a single column, so we have to reshape the output to
dim(rows, 1)
. Gotta love it. So first, we’ll define a method to reshape the
output of a function into something acceptable for FeatureUnion
. After that,
we’ll build our function that will wrap a function to be easily pipeline-able.
In [3]:
Adding the function and testing
In [4]:
null accuracy: 38.67%
accuracy score: 62.22%
model is 23.56% more accurate than null accuracy
---------------------------------------------------------------------------
Confusion Matrix
Predicted negative neutral positive __all__
Actual
negative 28 19 17 64
neutral 10 44 20 74
positive 5 14 68 87
__all__ 43 77 105 225
---------------------------------------------------------------------------
Classification Report
precision recall F1_score support
Classes
negative 0.651163 0.4375 0.523364 64
neutral 0.571429 0.594595 0.582781 74
positive 0.647619 0.781609 0.708333 87
__avg / total__ 0.623569 0.622222 0.614427 225
Almost done
Even though we did it in kind of a weird way, we are now able to add arbitrary functions as new feature columns!
We’re now ready for the last part of the series - doing a parameter grid search on the pipeline. Come on, let’s do it!
This is Part 4 of 5 in a series on building a sentiment analysis pipeline using scikit-learn. You can find Part 5 here, and the introduction here.
Jump to: