python databricks parquet python-wheel pkg-resources

AnalysisException: Path does not exist: dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data;

I am packing the following code in a whl file:

from pkg_resources import resource_filename
def path_to_model(anomaly_dir_name: str, data_path: str):
    filepath = resource_filename(anomaly_dir_name, data_path)
    return filepath
def read_data(spark) -> DataFrame:
    return (spark.read.parquet(str(path_to_model("sampleFolder", "data"))))

I confirmed that the whl file contains the parquet files under sampleFolder/data/ directory correctly. When i run this locally it works, but when i upload this whl file to dbfs and run then i get this error:

AnalysisException: Path does not exist: dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data;

I confirmed that this directory actually does not exist: dbfs:/databricks/python Any idea what this error could be?

Thanks.

Solution

By default Spark on Databricks works with files on DBFS, until you're explicitly change the schema. In your case, the path_to_model function returns the string /databricks/python/lib/python3.7/site-packages/sampleFolder/data, and because it doesn't have explicit schema, then Spark uses dbfs schema. But the file is on the local node, not on DBFS - that's why Spark can't find it.

To fix that, you need to copy data onto DBFS, and read from there. This could be done with dbutils.fs.cp command. Change code to following:

def read_data(spark) -> DataFrame:
    data_path = str(path_to_model("sampleFolder", "data"))
    tmp_path = "/tmp/my_sample_data"
    dbutils.fs.cp("file:" + data_path, tmp_path, True)
    return (spark.read.parquet(tmp_path))