Search code examples
pythonazure-machine-learning-serviceazure-data-lake-gen2

How to connect AMLS to ADLS Gen 2?


I would like to register a dataset from ADLS Gen2 in my Azure Machine Learning workspace (azureml-core==1.12.0). Given that service principal information is not required in the Python SDK documentation for .register_azure_data_lake_gen2(), I successfully used the following code to register ADLS gen2 as a datastore:

from azureml.core import Datastore

adlsgen2_datastore_name = os.environ['adlsgen2_datastore_name']
account_name=os.environ['account_name'] # ADLS Gen2 account name
file_system=os.environ['filesystem']

adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
    workspace=ws,
    datastore_name=adlsgen2_datastore_name,
    account_name=account_name, 
    filesystem=file_system
)

However, when I try to register a dataset, using

from azureml.core import Dataset
adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
data = Dataset.Tabular.from_delimited_files((adls_ds, 'folder/data.csv'))

I get an error

Cannot load any data from the specified path. Make sure the path is accessible and contains data. ScriptExecutionException was caused by StreamAccessException. StreamAccessException was caused by AuthenticationException. 'AdlsGen2-ReadHeaders' for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID <CLIENT_REQUEST_ID>, request ID <REQUEST_ID>. Error message: [REDACTED] | session_id=<SESSION_ID>

Do I need the to enable the service principal to get this to work? Using the ML Studio UI, it appears that the service principal is required even to register the datastore.

Another issue I noticed is that AMLS is trying to access the dataset here: https://adls_gen2_account_name.**dfs**.core.windows.net/container/folder/data.csv whereas the actual URI in ADLS Gen2 is: https://adls_gen2_account_name.**blob**.core.windows.net/container/folder/data.csv


Solution

  • According to this documentation,you need to enable the service principal.

    1.you need to register your application and grant the service principal with Storage Blob Data Reader access.

    enter image description here

    2.try this code:

    adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
                                                                 datastore_name=adlsgen2_datastore_name,
                                                                 account_name=account_name,
                                                                 filesystem=file_system,
                                                                 tenant_id=tenant_id,
                                                                 client_id=client_id,
                                                                 client_secret=client_secret
                                                                 )
    
    adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
    dataset = Dataset.Tabular.from_delimited_files((adls_ds,'sample.csv'))
    print(dataset.to_pandas_dataframe())
    

    Result:

    enter image description here