Search code examples
azurecommand-line-interfaceparquetazure-machine-learning-service

Azure ML CLI v2 create data asset with MLTable


I uploaded parquet files to a blobstorage and created a data asset via the Azure ML GUI. The steps are precise and clear and the outcome is as desired. For future usage I would like to use the CLI to create the data asset and new versions of it.

The base command would be az ml create data -f <file-name>.yml. The docs provide a minimal example of a MLTable file which should reside next to the parquet files.

# directory in blobstorage
├── data
│   ├── MLTable
│   ├── file_1.parquet
.
.
.
│   ├── file_n.parquet

I am still not sure how to properly specify those files in order to create a tabular dataset with column conversion.

Do I need to specify the full path or the pattern in the yml file?

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

type: mltable
name: Test data
description: Basic example for parquet files
path: azureml://datastores/workspaceblobstore/paths/*/*.parquet # pattern or path to dir?

I have more confusion about the MLTable file:

type: mltable

paths:
  - pattern: ./*.parquet
transformations:
  - read_parquet:
      # what comes here?

E.g. I have a column with dates with format %Y-%m%d %H:%M:%S which should be converted to a timestamp. (I can provide this information at least in the GUI!)

Any help on this topic or hidden links to documentation would be great.


Solution

  • A working MLTable file to convert string columns from parquet files looks like this:

    --- 
    type: mltable
    paths: 
      - pattern: ./*.parquet
    transformations: 
      - read_parquet: 
          include_path_column: false
      - convert_column_types:
          - columns: column_a
            column_type:
              datetime:
                formats:
                  - "%Y-%m-%d %H:%M:%S"
      - convert_column_types:
        - columns: column_b
          column_type:
            datetime:
              formats:
                - "%Y-%m-%d %H:%M:%S"
    

    (By the way, at the moment of writing this specifying multiple columns as array did not work, e.g. columns: [column_a, column_b])