I uploaded parquet files to a blobstorage and created a data asset via the Azure ML GUI. The steps are precise and clear and the outcome is as desired. For future usage I would like to use the CLI to create the data asset and new versions of it.
The base command would be az ml create data -f <file-name>.yml
. The docs provide a minimal example of a MLTable file which should reside next to the parquet files.
# directory in blobstorage
├── data
│ ├── MLTable
│ ├── file_1.parquet
.
.
.
│ ├── file_n.parquet
I am still not sure how to properly specify those files in order to create a tabular dataset with column conversion.
Do I need to specify the full path or the pattern in the yml
file?
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
type: mltable
name: Test data
description: Basic example for parquet files
path: azureml://datastores/workspaceblobstore/paths/*/*.parquet # pattern or path to dir?
I have more confusion about the MLTable file:
type: mltable
paths:
- pattern: ./*.parquet
transformations:
- read_parquet:
# what comes here?
E.g. I have a column with dates with format %Y-%m%d %H:%M:%S
which should be converted to a timestamp. (I can provide this information at least in the GUI!)
Any help on this topic or hidden links to documentation would be great.
A working MLTable file to convert string columns from parquet files looks like this:
---
type: mltable
paths:
- pattern: ./*.parquet
transformations:
- read_parquet:
include_path_column: false
- convert_column_types:
- columns: column_a
column_type:
datetime:
formats:
- "%Y-%m-%d %H:%M:%S"
- convert_column_types:
- columns: column_b
column_type:
datetime:
formats:
- "%Y-%m-%d %H:%M:%S"
(By the way, at the moment of writing this specifying multiple columns as array did not work, e.g. columns: [column_a, column_b]
)