Search code examples
azureazure-data-lakeu-sql

U-Sql view to merge duplicates via ranking


I have data lying in multiple files with naming convention as {year}/{month}/{date} which have duplicates (every day delta where records may get updated everyday).

I want to create a view that will return the records with the duplicates merged / squashed. The duplicates will be ranked and only the latest updated records corresponding to each primary key will be returned.

But the use of rowsets in view seems to be not supported. Basically something like this:

CREATE VIEW viewname AS 

@sourcedata  = EXTRACT  //schema
from //filenamePattern (regex)
using Extractors.TSV()

@sourceData =  SELECT *,
       ROW_NUMBER() OVER(PARTITION BY primary_Key ORDER BY timestamp DESC) AS RowNumber FROM @SourceData;

SELECT //schema
from @sourceData WHERE RowNumber == 1

So that when I do

select * from viewname

I get the merged data directly from the underlying files. How to achieve this ?


Solution

  • I think it can be solved by table valued function. Have you tried using it?

    https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-functions