Can't import installed python modules in spark cluster offered by Azure Databricks

I had just began to run python notebooks through spark cluster offered in Azure Databricks. As a requirement, we have installed couple of external packages like spacy and kafka, through both shell command as well as 'Create library' UI in databricks workspace.

python -m spacy download en_core_web_sm

However, every time we run 'import ' , cluster throws 'Module not found' error.

OSError: Can't find Model 'en_core_web_sm'

On top of that, we seem to find no way to know exactly where these modules are being installed. Issue persists despite adding the module path in 'sys.path'.

  • You can follow the below steps to install and load spaCy package on Azure Databricks.

    Step1: Install spaCy using pip and downloading the spaCy models.

    /databricks/python3/bin/pip install spacy 
    /databricks/python3/bin/python3 -m spacy download en_core_web_sm

    Notebook output:

    Step2: Running the example using spaCy.

    import spacy
    # Load English tokenizer, tagger, parser, NER and word vectors
    nlp = spacy.load("en_core_web_sm")
    # Process whole documents
    text = ("When Sebastian Thrun started working on self-driving cars at "
            "Google in 2007, few people outside of the company took him "
            "seriously. “I can tell you very senior CEOs of major American "
            "car companies would shake my hand and turn away because I wasn’t "
            "worth talking to,” said Thrun, in an interview with Recode earlier "
            "this week.")
    doc = nlp(text)
    # Analyze syntax
    print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
    print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
    # Find named entities, phrases and concepts
    for entity in doc.ents:
        print(entity.text, entity.label_)

    Notebook output:

    Hope this helps. Do let us know if you any further queries.

