Search code examples
apache-sparkpysparkdatabricks

Including a Period between Dictionaries in Apache Spark with Databricks


It would appear Databricks now require you to include the 'Catalog Name' when creating a Delta Table(well, at least in the way that I want to create a Delta Table).

Without going through the whole code, the following code will create a database and delta table in traditional Databricks with Hive_metastore called 'baseapm.securelogin'

deltadf = DeltaTable.forName(spark,f"{stageName}{regName}")

stageName = base

regName = apm.securelogin

Combining those two variables will output: baseapm.securelogin

enter image description here

Databricks appear to have made some changes and it would appear you now have to create the delta tables inside the Catalog Name in which case I need to modify the code above to include the Catalog Name.

My Catalog Name is called liveend2end.

Therefore, I need to modify my code above such that it outputs the following: liveend2end.baseapm.securelogin

I tried the following.

catName = 'liveend2end'
deltadf = DeltaTable.forName(spark,f"{catName}{stageName}{regName}")

But all I got was liveend2endbaseapm.securelogin

However, the output I need is: liveend2end.baseapm.securelogin

Could someone please show me how to modify the code so that I get liveend2end.baseapm.securelogin?

I also tried the following to add a period between catName

deltadf = DeltaTable.forName(spark,f"{catName}'.'{stageName}{regName}")

But I got the error:

ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near ''.''.

I have been shown how to add a period in the past, and I remember is that it was quite simple ... but I just can't remember how to place the period at the moment ....


Solution

  • One tip is to remember that catalogs, schemas, and tables should never have periods in their names; the period is the delimiter that informs the Spark catalog how to parse the fully qualified table name: <catalog>.<schema>.<table>. Naming your variables in a way that aligns 1-to-1 with each of those 3 layers may help reduce confusion.

    In the Op's question, these can be correctly identified as:

    • catalog = liveend2end
    • schema = baseapm
    • table = securelogin

    So the desired fully qualified table name, or expected output should be liveend2end.baseapm.securelogin.

    Now.. solutions:

    (TL;DR) Option 1: String Concatenation

    Using the OP's existing approach, which was only slightly off.

    catName = 'liveend2end'
    stageName = 'base'
    regName = 'apm.securelogin'
    
    table = f"{catName}.{stageName}{regName}"
    deltadf = DeltaTable.forName(spark, table)
    

    Option 2: USE CATALOG ...

    Similar, but this sets the Spark Session to use the specified catalog for any subsequent queries. This may be simpler if you want to minimize code changes.

    catName = 'liveend2end'
    stageName = 'base'
    regName = 'apm.securelogin'
    
    # Use Catalog
    spark.sql(f"USE CATALOG {catName}")
    
    # Remaining code is not changed:
    table = f"{stageName}{regName}"
    deltadf = DeltaTable.forName(spark, table)
    

    Option 3: Spark Conf to set default catalog

    Similar to option #2 but this requires zero code change and instead sets the default catalog using a Spark configuration which you would add to your cluster.

    spark.databricks.sql.initial.catalog.name=liveend2end
    
    # No code changes
    stageName = 'base'
    regName = 'apm.securelogin'
    
    table = f"{stageName}{regName}"
    deltadf = DeltaTable.forName(spark, table)
    

    Note: while it's highly recommended to upgrade to catalogs in Unity Catalog, hive_metastore is still a valid type of catalog. All the above options can be used with catName = 'hive_metastore' as well.