I need help selecting between spacy
and sklearn
for processing a huge text corpus. I ran a test to measure the performance of each, but the results were unexpected. Moreover, because I'm new-ish to the frameworks involved, I lack confidence that my test is completely valid. I'd really appreciate some guidance.
I'm doing a project that involves preprocessing 35 million Reddit comments. This is a pretty massive amount of text. So I'm searching for the most efficient framework to accomplish this with.
Currently, I am considering using either spacy
’s nlp.pipe
with several custom components, or a sklearn.Pipeline
with a ton of regex-based data transformers. Since (1) spacy
is optimized for text and (2) regex in Python is slow, I figured the spacy
option is the way to go. But I wanted to test my assumptions before proceeding.
So I wrote a quick and dirty script to do just that. It seems like a lot of code at first skim, but it's actually not. It's very modular, mostly consisting of simple classes. Skip to if __name__ ...
at the end to see the overall logic.
Anyway, this script defines what I think are broadly equivalent pipelines, one spacy
-based and one sklearn
-based, that simply remove (1) punctuation and (2) inline code like this
. These pipelines subclass an additional class which actually carries out the test. So the script loads a ~7.5k-comment sample from r/LanguageTechnology as a dask.dataframe
(for parallelization), applies the same preprocessing 100 times using each pipeline, then averages out the results.
To be clear, my actual pipeline will do several more things than just remove punctuation and inline code. I only chose those particular transformations for testing purposes, to keep my tests simple and to the point.
My findings (in seconds) are as follows, illustrated graphically here:
pipeline | mean | standard dev |
---|---|---|
spacy |
13.49772 | 1.182763 |
sklearn |
6.853291 | 0.127701 |
Clearly, spacy
was massively slower. This contradicted my expectations, and leaves me unable to draw firm conclusions.
Is sklearn.Pipeline
with regex truly the more efficient framework for this? Or was there an issue with my test, or how I structured my pipelines? The latter seems plausible because almost everything the script uses is new-ish to me - dask.dataframe
, spacy
with custom components, and sklearn.Pipeline
with custom transformers. So it may very well be that e.g., I'm just using spacy
wrong, or there's something about my script that renders the comparison apples to oranges instead of apples to apples.
In light of this uncertainty, I'd sincerely appreciate some input from anyone familiar with these frameworks. I'd also appreciate some eyes on my code, if possible, just to check that I've actually used everything properly.
Any and all input is welcome. Thank you!
Your regexes are faster here because they're only doing the work you need. spaCy is also doing tokenization, which for your preprocessing described here is not necessary, so it's not surprising it's slower.
Since it's likely you'll want tokens for whatever downstream processing you have, your current comparison may not be useful.