Search code examples
pythonnlpspacy

processing text with spacy nlp.pipe


I'm procerssing 40,000 abstracts with spacy nlp.pipe using the code below and its taking 8 mins. Is there a way to speed this up further? I've also disabled ner.

nlp = spacy.load("en_core_web_md", disable=["ner"])

def process_abstract(df):
    cleaned_text = []
    document = list(nlp.pipe(df['abstract'].values))
    for doc in document:
        text = [token.text for token in doc 
                if token.is_punct==False and 
                token.is_stop==False and 
                token.like_num==False and 
                token.is_alpha==True
                ]
        cleaned_text.append(' '.join(text).lower())
    return cleaned_text

Solution

  • Try tuning batch_size and n_process params :

    def process_abstract(df):
        cleaned_text = []
        document = nlp.pipe(df["abstract"].to_list(), batch_size=256, n_process=12)
        for doc in document:
            text = [
                token.text
                for token in doc
                if not token.is_punct
                and not token.is_stop
                and not token.like_num
                and token.is_alpha
            ]
            cleaned_text.append(" ".join(text).lower())
        return cleaned_text
    

    Note as well, by joining on " " you may have some surprises, as spaCy's splitting rules are a bit more complex than that.