I'm procerssing 40,000 abstracts with spacy nlp.pipe
using the code below and its taking 8 mins. Is there a way to speed this up further? I've also disabled ner
.
nlp = spacy.load("en_core_web_md", disable=["ner"])
def process_abstract(df):
cleaned_text = []
document = list(nlp.pipe(df['abstract'].values))
for doc in document:
text = [token.text for token in doc
if token.is_punct==False and
token.is_stop==False and
token.like_num==False and
token.is_alpha==True
]
cleaned_text.append(' '.join(text).lower())
return cleaned_text
Try tuning batch_size
and n_process
params :
def process_abstract(df):
cleaned_text = []
document = nlp.pipe(df["abstract"].to_list(), batch_size=256, n_process=12)
for doc in document:
text = [
token.text
for token in doc
if not token.is_punct
and not token.is_stop
and not token.like_num
and token.is_alpha
]
cleaned_text.append(" ".join(text).lower())
return cleaned_text
Note as well, by joining on " "
you may have some surprises, as spaCy's splitting rules are a bit more complex than that.