EDIT: I have been looking into the required results. Let me explain better: What I try to reach is to create a row for each YearMonth for each SomeData, between it's start and end date.
So for example SomeData "88888888888888888881", with startDate "2005-12-06 00:00:00.000" and EndDate 2006-03-13 00:00:000". I want rows to go like:
88888888888888888881, 200512
88888888888888888881, 200601
88888888888888888881, 200602
88888888888888888881, 200603
I am aware this might "explode" the result into a huge file.
Below my post:
I am trying to re-write something in U-SQL that we did before in T-SQL.
The problem is that U-SQL does not allow for a between to happen during the join.
The T-SQL JOIN looks something like this:
SELECT rf.SomeData AS SomeData,
rd.YearMonth AS YearMonth,
(rf.SomeData + '-' + rd.YearMonth.ToString()) AS MonthlyKey,
rf.SomeKey AS SomeKey
FROM MyTable rf
INNER JOIN dbo.DimDate rd
ON rd.Date >= rf.StartDate
AND rd.Date <= (CASE WHEN rf.EndDate IS NULL THEN GETDATE() ELSE rf.EndDate END)
In U-SQL I started like this, but how should I write the JOIN now?:
@EditedTable =
SELECT rf.SomeData AS SomeData,
rd.YearMonth AS YearMonth,
(rf.SomeData + "-" + rd.YearMonth.ToString()) AS MonthlyKey,
rf.SomeKey AS SomeKey
FROM @MyTable AS rf
INNER JOIN
@date AS rd
ON
It is important that we get all the data between start and end date, and create a monthly key so that "SomeData" can be joined with another table later on.
I have tried using a cross join but when running it, it get stuck at 80% and never seems to end. It keeps writing GB's in one vertex. Besides, I am not actually sure this will deliver the same results.
@EditedTableCROSS =
SELECT rfj.SomeData AS SomeData,
rfj.StartDate AS StartDate,
rfj.EndDate AS EndDate,
(rfj.SomeData + "-" + dtj.YearMonth.ToString()) AS MonthlyKey,
rfj.SomeKey AS SomeKey
FROM
(
SELECT SomeData AS SomeData,
StartDate AS StartDate,
EndDate AS EndDate,
SomeKey AS SomeKey
FROM @TableA
WHERE SomeData != ""
) AS rfj
CROSS JOIN
(
SELECT DISTINCT
dt.Date AS Date,
dt.YearMonth AS YearMonth,
dt.Month AS Month,
rf.StartDate AS StartDate
FROM @date AS dt INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date
WHERE rf.StartDate >= dt.Date AND
dt.Date <= DateTime.Now
) AS dtj
WHERE rfj.StartDate <= dtj.Date AND
rfj.EndDate >= dtj.Date;
The problem with the code above is that "INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date" is not on an unique key, certain dates occur a lot more than once. So I doubt this is the way....
Please share your ideas?
EDIT: People asked for sample data, End date could contain:
2006-03-13 10:27:13.000
2016-03-02 18:48:11.000
2016-03-02 18:42:57.000
NULL
2013-09-12 09:19:05.000
NULL
2016-03-02 18:59:37.000
NULL
NULL
StartDate:
2005-12-06 00:00:00.000
2011-03-29 20:57:51.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2008-01-17 00:00:00.000
DimDate contains dates from 2000 to 2018 at day level.
SomeDate and SomeKey would look something like:
88888888888888888881
88888888888888888882
88888888888888888883
88888888888888888884
88888888888888888885
88888888888888888886
88888888888888888887
88888888888888888888
88888888888888888889
I got this script to work with some sample data I generated.
@dateDim =
EXTRACT xdate DateTime,
yearMonth string
FROM "/input/dbo.DimDate.tsv"
USING Extractors.Tsv();
@data =
EXTRACT
someKey int,
someData string,
startDate DateTime,
endDate DateTime?
FROM "/input/dbo.MyTable.tsv"
USING Extractors.Tsv();
/*
// Use U-SQL ISNULL conditional operator which is ?
@working =
SELECT COUNT( * ) AS records
FROM
(
SELECT *
FROM @dateDim AS dd
CROSS JOIN
@data AS d
WHERE dd.xdate BETWEEN d.startDate AND (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
) AS x;
*/
@working =
SELECT COUNT( * ) AS records
FROM
(
SELECT *
FROM @dateDim AS dd
CROSS JOIN
@data AS d
WHERE dd.xdate >= d.startDate
AND dd.xdate <= (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
) AS x;