Search code examples
c#logiccortana-intelligenceazure-data-lakeu-sql

Azure Data Lake Analytics: Combine overlapping time duration using U-SQL


I want to remove overlapping time duration from CSV data placed in Azure Data Lake Store using U-SQL and combine those rows. Data set contains start time and end time with several other attributes for each record. Here is an example:

Start Time - End Time - Usar Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
8:00 AM - 9:00 AM - ABC
8:00 AM - 10:00 AM - ABC
10:00 AM - 2:00 PM - ABC
7:00 AM - 11:00 AM - ABC
9:00 AM - 11:00 AM - ABC
11:00 AM - 11:30 AM - ABC

After removing overlap, output data set will look like:

Start Time - End Time - Usar Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
7:00 AM - 2:00 PM - ABC

Note that CSV contains enormous amount of data and consists of several GB of size. I am trying to solve this problem but no luck. I want to avoid U-SQL User defined operators for Azure Data Lake Analytics jobs and looking for some efficient solution from within U-SQL.


Solution

  • It looks like you want to aggregate all the data for the rows that provide overlapping timeframes? Or what do you want to do with the data in the other columns?

    At first glance, I would suggest that you use a user-defined REDUCER or a user-defined aggregator, depending on what you want to achieve with the other data.

    However, a problem I see is that you may need a fix point recursion to create the common overlapping ranges. Unfortunately, there is no fix point recursion in U-SQL (nor Hive) because scale out processing of recursion can't be done efficiently.

    UPDATE AFTER CLARIFICATION:

    That is easier I think. You just take the min of the beginning and the max of the end and group by the key value:

    @r = EXTRACT begin DateTime, end DateTime,
                 data string
         FROM "/temp/ranges.txt"
         USING Extractors.Text(delimiter:'-');
    
    @r = SELECT MIN(begin) AS begin,
                MAX(end) AS end,
                data
         FROM @r
         GROUP BY data;
    
    OUTPUT @r
    TO "/temp/result.csv"
    USING Outputters.Csv();
    

    Note this works only if your ranges are on the same day and do not span over midnight.

    UPDATED WITH A SOLUTION THAT HANDLES DISJOINT RANGES FOR A USER You can solve it with a user-defined reducer. The following blog post explains the details of the solution and provides links to the GitHub code: https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/