Search code examples
azureazure-data-lakeu-sql

Join files from different pathes in USQL


My data is saved on a daily basis in the following path: "/Data/{year}/{month}/{day}/mydata.json" So, e.g. "/Data/2018/10/1/mydata.json" , "/Data/2018/10/2/mydata.json", "/Data/2018/11/1/mydata.json", "/Data/2018/12/5/mydata.json", etc.

I would like to combine all the months and days in one file using USQL. Is it possible to do it in an easy way without mentioning each path separately (otherwise it's crazy to do it for all the days of the year)?

At the moment I use this:

    
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

USING Microsoft.Analytics.Samples.Formats.Json;


@a  =
   EXTRACT EventCategory string
          , EventAction string
		  , EventLabel string
    FROM  "/Data/2018/10/2/mydata.json"
    USING new JsonExtractor()

	UNION ALL 
	
 EXTRACT EventCategory string
          , EventAction string
		  , EventLabel string
    FROM  "/Data/2018/11/2/mydata.json"
   USING new JsonExtractor(); 

 OUTPUT @a
   TO "/Output/mydata.Csv"
  USING Outputters.Csv(outputHeader:true);  
    
    


Solution

  • I would like to combine all the months and days in one file using USQL. Is it possible to do it in an easy way without mentioning each path separately (otherwise it's crazy to do it for all the days of the year)?

    Yes! You can do this using patterns, a basic example:

    REFERENCE ASSEMBLY [Newtonsoft.Json];
    REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
    
    DECLARE @input string = "/Data/2018/{*}/2/mydata.json";
    
    USING Microsoft.Analytics.Samples.Formats.Json;
    
    @a  =
       EXTRACT EventCategory string
              , EventAction string
              , EventLabel string
        FROM  @input
        USING new JsonExtractor()
    
     OUTPUT @a
       TO "/Output/mydata.Csv"
      USING Outputters.Csv(outputHeader:true);  
    

    this will load all data of the second day of the month.

    Other variations:

    DECLARE @input string = "/Data/2018/{*}/{*}/mydata.json"; will process all files of 2018

    DECLARE @input string = "/Data/{*}/12/{*}/mydata.json"; will process all files generated in the 12th month of all years

    If you want to retrieve the file parts to get the actual date parts you can do:

    REFERENCE ASSEMBLY [Newtonsoft.Json];
    REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
    
    USING Microsoft.Analytics.Samples.Formats.Json;
    
    @a  =
       EXTRACT EventCategory string
              , EventAction string
              , EventLabel string
              , date DateTime
        FROM  "/Data/{date:yyyy}/{date:MM}/{date:dd}/mydata.json"
        USING new JsonExtractor()
    
     OUTPUT @a
       TO "/Output/mydata.Csv"
      USING Outputters.Csv(outputHeader:true);  
    

    As you can see there is now an additional column date of type DateTime that can be used in the query and/or included in the output.