azure command-line-interface parquet azure-machine-learning-service

Azure ML CLI v2 create data asset with MLTable

I uploaded parquet files to a blobstorage and created a data asset via the Azure ML GUI. The steps are precise and clear and the outcome is as desired. For future usage I would like to use the CLI to create the data asset and new versions of it.

The base command would be az ml create data -f <file-name>.yml. The docs provide a minimal example of a MLTable file which should reside next to the parquet files.

# directory in blobstorage
├── data
│   ├── MLTable
│   ├── file_1.parquet
.
.
.
│   ├── file_n.parquet

I am still not sure how to properly specify those files in order to create a tabular dataset with column conversion.

Do I need to specify the full path or the pattern in the yml file?

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

type: mltable
name: Test data
description: Basic example for parquet files
path: azureml://datastores/workspaceblobstore/paths/*/*.parquet # pattern or path to dir?

I have more confusion about the MLTable file:

type: mltable

paths:
  - pattern: ./*.parquet
transformations:
  - read_parquet:
      # what comes here?

E.g. I have a column with dates with format %Y-%m%d %H:%M:%S which should be converted to a timestamp. (I can provide this information at least in the GUI!)

Any help on this topic or hidden links to documentation would be great.

Solution

A working MLTable file to convert string columns from parquet files looks like this:

--- 
type: mltable
paths: 
  - pattern: ./*.parquet
transformations: 
  - read_parquet: 
      include_path_column: false
  - convert_column_types:
      - columns: column_a
        column_type:
          datetime:
            formats:
              - "%Y-%m-%d %H:%M:%S"
  - convert_column_types:
    - columns: column_b
      column_type:
        datetime:
          formats:
            - "%Y-%m-%d %H:%M:%S"

(By the way, at the moment of writing this specifying multiple columns as array did not work, e.g. columns: [column_a, column_b])