I'm experimenting with Azure Data Lake, and trying to consume a bunch of data files.
The files are CSVs. The folder structure looks like:
/jobhistory/(AccountId)/(JobId)/*.csv
In the CSV files, the 6th column is the username
.
What I'd like to do is extract out the account id, the job id, and the username (then, in the interests of experimentation, do some aggregates on this data).
Following an online tutorial, I wrote something like this:
DECLARE @file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";
@metadata =
EXTRACT AccountId int,
JobId string,
FileName string,
UserName string
FROM @file_set_path
USING Extractors.Csv();
Now, the problem (I think) I have is that the UserName
field is the 6th column in the csv files, but there are no header rows in the files.
How would I assign UserName
to the 6th column in the files?
Also, please let me know if I'm totally going down a wrong path here; this is very different from what I'm used to.
The built-in CSV
extractor is a positional extractor. That means you have to specify all columns (even those you are not interested in) in the extract schema.
So you would write something like (assuming username is the 6th col and you have 10 cols):
DECLARE @file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";
@metadata =
EXTRACT AccountId int,
JobId string,
FileName string,
c1 string, c2 string, c3 string, c4 string, c5 string,
UserName string,
c7 string, c8 string, c9 string
FROM @file_set_path
USING Extractors.Csv();
@metadata =
SELECT AccountId,
JobId,
FileName,
UserName
FROM @metadata;
Note that the SELECT
projection will be pushed into the EXTRACT
so it will not do the full column processing for the columns that you do not select.
If you know that the 6th column is the column you are interested in, you could also write a custom extractor to skip the other columns. However the trade-off of running a custom extractor compared to a built-in one may not be worth it.
(Also note that you can use the ADL Tooling to create the EXTRACT
expression (without the virtual columns), so you do not need to do it by hand: