Search code examples
apache-sparkazure-databricks

Can't import installed python modules in spark cluster offered by Azure Databricks


I had just began to run python notebooks through spark cluster offered in Azure Databricks. As a requirement, we have installed couple of external packages like spacy and kafka, through both shell command as well as 'Create library' UI in databricks workspace.

python -m spacy download en_core_web_sm

However, every time we run 'import ' , cluster throws 'Module not found' error.

OSError: Can't find Model 'en_core_web_sm'

On top of that, we seem to find no way to know exactly where these modules are being installed. Issue persists despite adding the module path in 'sys.path'.

Please let us know how to fix this as soon as possible


Solution

  • You can follow the below steps to install and load spaCy package on Azure Databricks.

    Step1: Install spaCy using pip and downloading the spaCy models.

    %sh
    /databricks/python3/bin/pip install spacy 
    /databricks/python3/bin/python3 -m spacy download en_core_web_sm
    

    Notebook output:

    enter image description here

    Step2: Running the example using spaCy.

    import spacy
    
    # Load English tokenizer, tagger, parser, NER and word vectors
    nlp = spacy.load("en_core_web_sm")
    
    # Process whole documents
    text = ("When Sebastian Thrun started working on self-driving cars at "
            "Google in 2007, few people outside of the company took him "
            "seriously. “I can tell you very senior CEOs of major American "
            "car companies would shake my hand and turn away because I wasn’t "
            "worth talking to,” said Thrun, in an interview with Recode earlier "
            "this week.")
    doc = nlp(text)
    
    # Analyze syntax
    print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
    print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
    
    # Find named entities, phrases and concepts
    for entity in doc.ents:
        print(entity.text, entity.label_)
    

    Notebook output:

    enter image description here

    Hope this helps. Do let us know if you any further queries.


    Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.