I had just began to run python notebooks through spark cluster offered in Azure Databricks. As a requirement, we have installed couple of external packages like spacy and kafka, through both shell command as well as 'Create library' UI in databricks workspace.
python -m spacy download en_core_web_sm
However, every time we run 'import ' , cluster throws 'Module not found' error.
OSError: Can't find Model 'en_core_web_sm'
On top of that, we seem to find no way to know exactly where these modules are being installed. Issue persists despite adding the module path in 'sys.path'.
Please let us know how to fix this as soon as possible
You can follow the below steps to install and load spaCy package on Azure Databricks.
Step1: Install spaCy using pip and downloading the spaCy models.
%sh
/databricks/python3/bin/pip install spacy
/databricks/python3/bin/python3 -m spacy download en_core_web_sm
Notebook output:
Step2: Running the example using spaCy.
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
"Google in 2007, few people outside of the company took him "
"seriously. “I can tell you very senior CEOs of major American "
"car companies would shake my hand and turn away because I wasn’t "
"worth talking to,” said Thrun, in an interview with Recode earlier "
"this week.")
doc = nlp(text)
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)
Notebook output:
Hope this helps. Do let us know if you any further queries.
Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.