Search code examples
python-2.7apache-sparkgreat-expectations

How to Save an Great Expectation to Azure Data Lake or Blob Store


I'm trying save an great_expectations 'expectation_suite to Azue ADLS Gen 2 or Blob store with the following line of code.

ge_df.save_expectation_suite('abfss://polybase@mipolybasestagingsbox.dfs.core.windows.net/test/newdata/loggingtableupdate.json')

However, I'm getting the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'abfss://polybase@mipolybasestagingsbox.dfs.core.windows.net/test/newdata/loggingtableupdate.json'

The following is successful, but I don't know where the expectation suite is saved to:

ge_df.save_expectation_suite('gregs_expectations.json')

If someone can let me know how to save to adls gen2 or let me know where the expectation is saved to that would be great


Solution

  • Great expectations can't save to ADLS directly - it's just using the standard Python file API that works only with local files. The last command will store the data into the current directory of the driver, but you can set path explicitly, for example, as /tmp/gregs_expectations.json.

    After saving, the second step will be to uplaod it to ADLS. On Databricks you can use dbutils.fs.cp to put file onto DBFS or ADLS. If you're not running on Databricks, then you can use azure-storage-file-datalake Python package to upload file to ADLS (see its docs for details), something like this:

    from azure.storage.filedatalake import DataLakeFileClient
    
    with open('/tmp/gregs_expectations.json', 'r') as file:
        data = file.read()
    
    file = DataLakeFileClient.from_connection_string("my_connection_string", 
                                                     file_system_name="myfilesystem", 
                                                     file_path="gregs_expectations.json")
    file.create_file ()
    file.append_data(data, offset=0, length=len(data))
    file.flush_data(len(data))