Search code examples
azurematillion

Load data from public Azure blob in Matillion


I am going through Matillion Academy (Building a Data Warehouse). There is a slide deck to follow online and I am running my own instance of Matillion to recreate the building of the warehouse. My Matillion is on Azure, as is my Snowflake database. The training is AWS-based, but gives information about the adjustments needed for Azure or GS.

One of the steps shows how to Load data from blob storage. It is S3 based. For Azure different components need to be used (as the S3 ones don't exist there), and data needs to be loaded from azure storage instead of S3 storage. It also explains that for Snowflake on Azure yet another component needs to be used.

I have created a Stage in Snowflake:

    CREATE STAGE "onlinemtlntrainingazure_flights" 
    URL='azure://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights'

The stage shows in Snowflake (external stage) and in Matillion (when using 'manage stages' on the database). The code is taken from the json file I imported to create the job to do this (see first step below).

I have created the target table in my database. It is accessible and visible in Matillion IDE.

The adjusted component I am to use is 'Azure Blob Storage Load'. According to the documentation, I will need:

For Snowflake on Azure: Create a Stage in Snowflake:

You should create a Stage in Snowflake which will be pointing to the public data we provide. Please, find below the .json file containing the job that will help you to do this. Don't forget to change the SQL Script for pointing to your own schema

After Creating the Stage in Snowflake:

You should use the 'Create Table' and the 'Azure Blob Storage Load' components individually as the 'Azure Blob Load Generator' won't let you to select the Stage previously created. We have attached below the Create Table metadata to save you some time.

'Azure Blob Storage Load' Settings:

Stage: onlinemtlntrainingazure_flights Pattern: training_azure_flights_2016.gz Target Table: training_flights Record Delimiter: 0x0a Skip Header: 1

The source data on Azure is located here:

Azure Blob Container (with flights data)

https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2016.gz

https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2017.gz

https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2018.gz

Unfortunately, when using these settings on the 'Azure Blob Storage Load' component, it complains.

  • the stage does not appear in the list, and manually inputting the stage name yields an error (unrecognised option). prefixing the stage name with my schema (and even database) does not help.
  • azure storage location property does not accept the https://... URI to the data files. When I replace the 'https' by 'azure', or remove the part after the last '/' it complains with 'Unable to find an account with name: [onlinemtlntrainingazure]'
  • using [Custom] for stage property removes the error message, but when running the component, it comes back again with the 'Unable to find account'.

Any thoughts?

Edit: I found a workaround by using the Data Transfer Object, which first copies the files from the public https location to my own Azure blob location and then I process it further from there. But I would like to know how to do it as suggested in the training, and why it now fails.


Solution

  • The example files are in a storage account that your Azure Blob Storage Load Generator can not read from. But instead of using a Snowflake Stage, you might find it easier to just copy the files into a storage account that you do own, and then use the Azure Blob Storage Load Generator on the copied files.

    In a Matillion ETL instance on Azure, you can access files over https and copy them into your own storage account using a Data Transfer component. enter image description here

    You already have the https:// source URLs for the three files, so:

    • Set the source type to HTTPS (no username or password is needed)
    • Add the source URL
    • Set the target type to Azure Blob Storage
    • In the example I used two variables, with defaults set to my storage account and container name
    • Repeat for all three files

    After running the Data Transfer three times, you will then be able to proceed with the Azure Blob Storage Load Generator, reading from your own copies of the files.