Search code examples
hdfsbigdataazure-data-lakeu-sql

U-SQL extract a column based on its ordinal position


I'm experimenting with Azure Data Lake, and trying to consume a bunch of data files.

The files are CSVs. The folder structure looks like:

/jobhistory/(AccountId)/(JobId)/*.csv

In the CSV files, the 6th column is the username.

What I'd like to do is extract out the account id, the job id, and the username (then, in the interests of experimentation, do some aggregates on this data).

Following an online tutorial, I wrote something like this:

DECLARE @file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";

@metadata =
    EXTRACT AccountId int,
            JobId string,
            FileName string,
            UserName string
    FROM @file_set_path
    USING Extractors.Csv();

Now, the problem (I think) I have is that the UserName field is the 6th column in the csv files, but there are no header rows in the files.

How would I assign UserName to the 6th column in the files?

Also, please let me know if I'm totally going down a wrong path here; this is very different from what I'm used to.


Solution

  • The built-in CSV extractor is a positional extractor. That means you have to specify all columns (even those you are not interested in) in the extract schema.

    So you would write something like (assuming username is the 6th col and you have 10 cols):

    DECLARE @file_set_path = "/jobhistory/{AccountId}/{JobId}/{FileName}.csv";
    
    @metadata =
      EXTRACT AccountId int,
              JobId string,
              FileName string,
              c1 string, c2 string, c3 string, c4 string, c5 string,
              UserName string,
              c7 string, c8 string, c9 string
      FROM @file_set_path
      USING Extractors.Csv();
    
    @metadata = 
      SELECT AccountId,
              JobId,
              FileName,
              UserName
      FROM @metadata;
    

    Note that the SELECT projection will be pushed into the EXTRACT so it will not do the full column processing for the columns that you do not select.

    If you know that the 6th column is the column you are interested in, you could also write a custom extractor to skip the other columns. However the trade-off of running a custom extractor compared to a built-in one may not be worth it.

    (Also note that you can use the ADL Tooling to create the EXTRACT expression (without the virtual columns), so you do not need to do it by hand:

    https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_Summer/USQL_Release_Notes_2017_Summer.md#adl-tools-for-visualstudio-now-helps-you-generate-the-u-sql-extract-statement