Search code examples
performancenlpspacydask-dataframescikit-learn-pipeline

Is `sklearn.Pipeline` with regex really more performant than `spacy` for preprocessing huge volumes of text?


TL;DR

I need help selecting between spacy and sklearn for processing a huge text corpus. I ran a test to measure the performance of each, but the results were unexpected. Moreover, because I'm new-ish to the frameworks involved, I lack confidence that my test is completely valid. I'd really appreciate some guidance.

Background

I'm doing a project that involves preprocessing 35 million Reddit comments. This is a pretty massive amount of text. So I'm searching for the most efficient framework to accomplish this with.

Currently, I am considering using either spacy’s nlp.pipe with several custom components, or a sklearn.Pipeline with a ton of regex-based data transformers. Since (1) spacy is optimized for text and (2) regex in Python is slow, I figured the spacy option is the way to go. But I wanted to test my assumptions before proceeding.

The test

So I wrote a quick and dirty script to do just that. It seems like a lot of code at first skim, but it's actually not. It's very modular, mostly consisting of simple classes. Skip to if __name__ ... at the end to see the overall logic.

Anyway, this script defines what I think are broadly equivalent pipelines, one spacy-based and one sklearn-based, that simply remove (1) punctuation and (2) inline code like this. These pipelines subclass an additional class which actually carries out the test. So the script loads a ~7.5k-comment sample from r/LanguageTechnology as a dask.dataframe (for parallelization), applies the same preprocessing 100 times using each pipeline, then averages out the results.

To be clear, my actual pipeline will do several more things than just remove punctuation and inline code. I only chose those particular transformations for testing purposes, to keep my tests simple and to the point.

Results

My findings (in seconds) are as follows, illustrated graphically here:

pipeline mean standard dev
spacy 13.49772 1.182763
sklearn 6.853291 0.127701

Clearly, spacy was massively slower. This contradicted my expectations, and leaves me unable to draw firm conclusions.

Is sklearn.Pipeline with regex truly the more efficient framework for this? Or was there an issue with my test, or how I structured my pipelines? The latter seems plausible because almost everything the script uses is new-ish to me - dask.dataframe, spacy with custom components, and sklearn.Pipeline with custom transformers. So it may very well be that e.g., I'm just using spacy wrong, or there's something about my script that renders the comparison apples to oranges instead of apples to apples.

Cry for help

In light of this uncertainty, I'd sincerely appreciate some input from anyone familiar with these frameworks. I'd also appreciate some eyes on my code, if possible, just to check that I've actually used everything properly.

Any and all input is welcome. Thank you!


Solution

  • Your regexes are faster here because they're only doing the work you need. spaCy is also doing tokenization, which for your preprocessing described here is not necessary, so it's not surprising it's slower.

    Since it's likely you'll want tokens for whatever downstream processing you have, your current comparison may not be useful.