Search code examples
pythonpysparkapache-spark-sqlazure-databricks

Reading / Extracting Data from Databricks Database (hive_metastore ) with PySpark


I am trying to read in data from Databricks Hive_Metastore with PySpark. In screenshot below, I am trying to read in the table called 'trips' which is located in the database nyctaxi.

Typically if this table was located on a AzureSQL server I was use code like the following:

df = spark.read.format("jdbc")\
    .option("url", jdbcUrl)\
    .option("dbtable", tableName)\
    .load()

Or if the table was in the ADLS I would use code similar to the following:

df = spark.read.csv("adl://mylake.azuredatalakestore.net/tableName.csv",header=True)

Can some let me know how I would read in the table using PySpark from Databricks Database below:

enter image description here

The additional screenshot my also help

enter image description here

Ok, I've just realized that I think I should be asking how to read tables from "samples" meta_store.

In any case I would like help reading in the "trips" table from the nyctaxi database please.


Solution

  • The samples catalog can be accessed in using spark.table("catalog.schema.table").

    So you should be able to access the table using:

    df = spark.table("samples.nyctaxi.trips")
    

    Note also if you are working direct in databricks notebooks, the spark session is already available as spark - no need to get or create.