Not a good title to the post, but hopefully it'll catch some eyes.
I have a very complex situation in T-SQL that I am unable to accomplish. I'm hoping someone with expertise knows an elegant and fast solution so that my performance is not impacted. I'm dealing with billions of rows.
PREFACE
I have a table called Customers with a unique ID. Those customers have Files, Files have Properties, and each Property Name corresponds to a single Value.
Tables:
The Customer ID is present in all of these tables, as are audit fields such as UpdatedDtm and CreationDtm.
USE CASE
I need to join all customers to their files (filtering for a few) and then tie every file to their properties (again filtering these). This is easy but results in lots of rows, one for each customer x file x property.
I know that the property names will never changes, and I want to return just a select few, so I used a pivot and resulted in a nice table, but it fell apart after I started doing more complex queries.
THE PROBLEM
First, the properties have a DateTime for when they were altered (UpdatedDtm), and I need to return everything altered from 1 hour of the creation date (CreationDtm) in the File table.
This results in me trimming down my list of potential properties, but now I have a table with an RowNumber() per ID and no good way to pivot and select the first one that isn't null and still preserve the number of columns for the table defnition. This is important because I'm using Dynamic SQL and placing it in an indexed temp table with a Composite Key on CustomerID and FileName.
BEFORE PIVOT
| UpdatedDtm | CustomerID | FileName | Property | Value |
| ---------- | ---------- | ---------- | -------- | -------------- |
| 1/1/2015 | 1 | FileOne | Size | NULL |
| 1/1/2015 | 1 | FileOne | Format | JPG |
| 1/7/2015 | 1 | FileOne | Size | 88KB |
| 1/7/2015 | 1 | FileOne | Format | JPG |
| 1/7/2015 | 1 | FileOne | Comment | NULL |
| 1/11/2015 | 1 | FileOne | Comment | NULL |
| 1/1/2015 | 1 | FileTwo | Size | 91KB |
| 1/1/2015 | 1 | FileTwo | Format | PNG |
| 1/11/2015 | 1 | FileTwo | Comment | NULL |
| 1/2/2015 | 2 | FileThree | Size | 74KB |
| 1/2/2015 | 2 | FileThree | Format | XLS |
| 1/2/2015 | 2 | FileThree | State | Open |
| 1/7/2015 | 2 | FileThree | State | Closed |
| 1/10/2015 | 2 | FileThree | Comment | NULL |
| 1/1/2015 | 3 | FileFour | Size | 2KB |
| 1/2/2015 | 3 | FileFour | Size | 10KB |
| 1/3/2015 | 3 | FileFour | Size | 13KB |
| 1/4/2015 | 3 | FileFour | Size | 21KB |
| 1/5/2015 | 3 | FileFour | Size | 27KB |
| 1/6/2015 | 3 | FileFour | Size | 32KB |
| 1/7/2015 | 3 | FileFour | Size | 39KB |
| 1/8/2015 | 3 | FileFour | Size | 44KB |
| 1/1/2015 | 3 | FileFour | Format | TXT |
| 1/1/2015 | 3 | FileFour | Comment | NULL |
Please don't ask me why the database is setup this way or to change the schema. That is set in stone and out of my control. I need to be able to solve the use case as described.
AFTER PIVOT (Expectation)
| CustomerID | FileName | Size | Format | State | Comment |
| ---------- | ---------- | ---- | ------ | ------ | ------- |
| 1 | FileOne | 88KB | JPG | NULL | NULL |
| 1 | FileTwo | 91KB | PNG | NULL | NULL |
| 2 | FileThree | 74KB | XLS | Closed | NULL |
| 3 | FileFour | 44KB | TXT | NULL | NULL |
I have included some NULL values and missing values to showcase that I need to preserve the same columnar properties regardless of them having data, but I also need to squash the data by the the first non-null value within my date range.
CODE (My attempt)
IF Object_id('tempdb..#FilesQuery') IS NOT NULL DROP TABLE #FilesQuery;
CREATE TABLE #FilesQuery (
SeqNum int,
CustomerID numeric(16,0),
FileName varchar(64),
PropertyName varchar(64),
PropertyValue varchar(64)
)
INSERT INTO #FilesQuery
SELECT
CASE WHEN P.[Value] IS NOT NULL
THEN ROW_NUMBER() OVER (partition by C.CustomerID order by UpdatedDtm)
ELSE 0
END as SeqNum,
C.CustomerID
,F.Name as FileName
,P.Name as PropertyName
,P.Value as PropertyValue
FROM Customers C
INNER JOIN Files F ON F.CustomerID = C.CustomerID
LEFT JOIN Properties P
ON P.CustomerID = C.CustomerID
AND P.FileID = F.FileID
WHERE F.FileName IN ('FileOne','FileTwo','FileThree','FileFour')
AND P.Name IN ('Size','Format','State','Comment')
--PIVOT
DECLARE @cols AS nvarchar(MAX)
SELECT @cols = STUFF(
(SELECT DISTINCT ',' + QUOTENAME(PropertyName)
FROM #FilesQuery fq
FOR XML PATH(''),TYPE).value('.','NVARCHAR(MAX)'),1,1,'')
DECLARE @dynSql AS nvarchar(MAX)
SET @dynSql = '
SELECT DISTINCT *
FROM (
SELECT
fq.CustomerID,
fq.FileName,
fq.PropertyName,
fq.PropertyValue
FROM #FilesQuery fq
) SRC
PIVOT (
Max([PropertyValue])
FOR PropertyName IN (' + @cols + ')
) PVT
'
IF Object_id('tempdb..#Results') IS NOT NULL DROP TABLE #Results;
CREATE TABLE #Results (
CustomerID varchar(16) NOT NULL,
FileName varchar(64) NOT NULL,
FileSize varchar(64) NULL,
FileFormat varchar(64) NULL,
FileState varchar(64) NULL,
FileComment varchar(64) NULL,
CONSTRAINT pk_CustDoc PRIMARY KEY (CustomerID,FileName)
)
INSERT INTO #Results EXEC @dynSql;
I'm sorry this code isn't complete, it is the working section I have. The other tries I made resulted in bad data pulls.
I tried using SeqNum and a combination of case statements to try and select the first non-null value for each row so that the data was all on one line, but it ended up being more like.
FileOne NULL NULL Open NULL
FileOne NULL JPG NULL NULL
and so on...
I've been struggling on solving this special case for awhile and am about to scrap and it do something procedural with looping, but that would kill my query time and performance.
Anyone have a good solution? Am I over-thinking things?
you should filter your data before you PIVOT
and you will get your desired results. Here is a cte version to show you the steps of how to get what you want.
;WITH cteDefineRowPrecedence AS (
SELECT *
,ROW_NUMBER() OVER (PARTITION BY CustomerId, FileName, Property ORDER BY
CASE WHEN Value IS NOT NULL THEN 0 ELSE 1 END
,UpdatedDtm DESC) as RowNum
FROM
@Table
)
, cteDesiredRwows AS (
SELECT
CustomerId
,FileName
,Property
,Value
FROM
cteDefineRowPrecedence t
WHERE
t.RowNum = 1
AND t.Value IS NOT NULL
)
SELECT *
FROM
cteDesiredRwows t
PIVOT (
MAX(Value)
FOR Property IN (Size,[Format],[State],Comment)
) p
ORDER BY
CustomerId
,FileName
And here is a nested query version that will make it easier to embed/put in your dynamic sql....
SELECT *
FROM
(
SELECT CustomerId, FileName, Property, Value
FROM
(SELECT *
,ROW_NUMBER() OVER (PARTITION BY CustomerId, FileName, Property ORDER BY
CASE WHEN Value IS NOT NULL THEN 0 ELSE 1 END
,UpdatedDtm DESC) as RowNum
FROM
@Table) r
WHERE
r.RowNum = 1
AND r.Value IS NOT NULL
) t
PIVOT (
MAX(Value)
FOR Property IN (Size,[Format],[State],Comment)
) p
ORDER BY
CustomerId
,FileName