Search code examples
apache-sparkpysparkdatabricksaws-glueamazon-athena

Not able to query AWS Glue/Athena views in Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;']


Attempting to read a view which was created on AWS Athena (based on a Glue table that points to an S3's parquet file) using pyspark over a Databricks cluster throws the following error for an unknown reason:

java.lang.IllegalArgumentException: Can not create a Path from an empty string;

The first assumption was that access permissions are missing, but that wasn't the case.

When keep researching, I found the following Databricks' post about the reason for this issue: https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system


Solution

  • I was able to come up with a python script to fix the problem. It turns out that this exception occurs because Athena and Presto store view's metadata in a format that is different from what Databricks Runtime and Spark expect. You'll need to re-create your views through Spark

    Python script example with execution example:

    import boto3
    import time
    
    
    def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
        athena = boto3.client("athena", region_name=aws_region)
        res = athena.start_query_execution(QueryString=query, ResultConfiguration={
            'OutputLocation': athenaOutputPath})
        execution_id = res["QueryExecutionId"]
        while True:
            res = athena.get_query_execution(QueryExecutionId=execution_id)
            state = res["QueryExecution"]["Status"]["State"]
            if state == "SUCCEEDED":
                return
            if state in ["FAILED", "CANCELLED"]:
                raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
            time.sleep(1)
    
    
    def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
        glue = boto3.client("glue", region_name=aws_region)
        glue.delete_table(DatabaseName=db, Name=table)
        create_view_sql = f"create view {db}.{table} as {query}"
        execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
        presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
            "ViewOriginalText"
        ]
        glue.delete_table(DatabaseName=db, Name=table)
    
        spark_session.sql(create_view_sql).show()
        spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
        for key in [
            "DatabaseName",
            "CreateTime",
            "UpdateTime",
            "CreatedBy",
            "IsRegisteredWithLakeFormation",
            "CatalogId",
        ]:
            if key in spark_view:
                del spark_view[key]
        spark_view["ViewOriginalText"] = presto_schema
        spark_view["Parameters"]["presto_view"] = "true"
        spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
    
    
    create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
    

    Again, note that this script keeps your views compatible with Glue/Athena.

    References: