Search code examples
mongodbpysparkdatabricksazure-cosmosdb-mongoapi

Connecting AzureDatabricks on a CosmosDB MongoDB API database


I am trying to connect a Python notebook in an Azure Databricks cluster on a CosmosDB MongoDB API database.

I'm using the mongo connector 2.11.2.4.2 Python 3

My code is as follows:

ReadConfig = {
  "Endpoint" : "https://<my_name>.mongo.cosmos.azure.com:443/",
  "Masterkey" : "<my_key>",
  "Database" : "database",
  "preferredRegions" : "West US 2",
  "Collection": "collection1",
  "schema_samplesize" : "1000",
  "query_pagesize" : "200000",
  "query_custom" : "SELECT * FROM c"
}



df = spark.read.format("mongo").options(**ReadConfig).load()
df.createOrReplaceTempView("dfSQL")

The error I get is that Could not initialize class com.mongodb.spark.config.ReadConfig$.

How can I work this out?


Solution

  • Answer to my own question.

    Using MAVEN as the source, I installed the right library to my cluster using the path

    org.mongodb.spark:mongo-spark-connector_2.11:2.4.0

    Spark 2.4

    An example of code I used is as follows (for those who wanna try):

    # Read Configuration
    readConfig = {
        "URI": "<URI>",
        "Database": "<database>",
        "Collection": "<collection>",
      "ReadingBatchSize" : "<batchSize>"
      }
    
    
    pipelineAccounts = "{'$sort' : {'account_contact': 1}}"
    
    # Connect via azure-cosmosdb-spark to create Spark DataFrame 
    accountsTest = (spark.read.
                     format("com.mongodb.spark.sql").
                     options(**readConfig).
                     option("pipeline", pipelineAccounts).
                     load())
    
    accountsTest.select("account_id").show()