I'm having an issue with scheduling job in Data Factory. I'm trying to approach a scheduled job per hour which will execute the same script each hour with different condition.
Consider I have a bunch of Avro Files spread in Azure Data Lake Store with following pattern. /Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}{date:MM}{date:dd}__{date:H}
Each hour new files are added to Data Lake Store. In order to process the files only once I decided to handle them by help of U-SQL virtual file set column and some SyncTable which i created in Data Lake Store.
My query looks like following.
DECLARE @file_set_path string = /Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H};
@result = EXTRACT [Id] long,
....
date DateTime
FROM @file_set_path
USING someextractor;
@rdate =
SELECT MAX(ProcessedDate) AS ProcessedDate
FROM dbo.SyncTable
WHERE EntityName== "SomeEntity";
@finalResult = SELECT [Id],... FROM @result
CROSS JOIN @rdate AS r
WHERE date >= r.ProcessedDate;
since I can't use rowset variable in where clause I'm cross joining the singe row with set , however even in this case U-SQL won't find the correct files and always return all files set.
Is there any workaround or other approach ?
Also please note that file sets cannot perform partition elimination on dynamic joins, since the values are not known to the optimizer during the preparation phase.
I would suggest to pass the Sync point as a parameter from ADF to the processing script. Then the value is known to the optimizer and file set partition elimination will kick in. In the worst case, you would have to read the value from your sync table in a previous script and use it as a parameter in the next.