apache-spark pyspark databricks aws-glue amazon-athena

Not able to query AWS Glue/Athena views in Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;']

Attempting to read a view which was created on AWS Athena (based on a Glue table that points to an S3's parquet file) using pyspark over a Databricks cluster throws the following error for an unknown reason:

java.lang.IllegalArgumentException: Can not create a Path from an empty string;

The first assumption was that access permissions are missing, but that wasn't the case.

When keep researching, I found the following Databricks' post about the reason for this issue: https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system

Solution

I was able to come up with a python script to fix the problem. It turns out that this exception occurs because Athena and Presto store view's metadata in a format that is different from what Databricks Runtime and Spark expect. You'll need to re-create your views through Spark

Python script example with execution example:

import boto3
import time


def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
    athena = boto3.client("athena", region_name=aws_region)
    res = athena.start_query_execution(QueryString=query, ResultConfiguration={
        'OutputLocation': athenaOutputPath})
    execution_id = res["QueryExecutionId"]
    while True:
        res = athena.get_query_execution(QueryExecutionId=execution_id)
        state = res["QueryExecution"]["Status"]["State"]
        if state == "SUCCEEDED":
            return
        if state in ["FAILED", "CANCELLED"]:
            raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
        time.sleep(1)


def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
    glue = boto3.client("glue", region_name=aws_region)
    glue.delete_table(DatabaseName=db, Name=table)
    create_view_sql = f"create view {db}.{table} as {query}"
    execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
    presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
        "ViewOriginalText"
    ]
    glue.delete_table(DatabaseName=db, Name=table)

    spark_session.sql(create_view_sql).show()
    spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
    for key in [
        "DatabaseName",
        "CreateTime",
        "UpdateTime",
        "CreatedBy",
        "IsRegisteredWithLakeFormation",
        "CatalogId",
    ]:
        if key in spark_view:
            del spark_view[key]
    spark_view["ViewOriginalText"] = presto_schema
    spark_view["Parameters"]["presto_view"] = "true"
    spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)


create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")

Again, note that this script keeps your views compatible with Glue/Athena.

References: