Search code examples
sqlsql-servert-sqlpivotnotnull

How to pivot sql data, and squash results into non-null rows by Date per ID


Not a good title to the post, but hopefully it'll catch some eyes.

I have a very complex situation in T-SQL that I am unable to accomplish. I'm hoping someone with expertise knows an elegant and fast solution so that my performance is not impacted. I'm dealing with billions of rows.


PREFACE

I have a table called Customers with a unique ID. Those customers have Files, Files have Properties, and each Property Name corresponds to a single Value.

Tables:

  • Customers
  • Files -
  • Property - contains both Name and Value

The Customer ID is present in all of these tables, as are audit fields such as UpdatedDtm and CreationDtm.


USE CASE

I need to join all customers to their files (filtering for a few) and then tie every file to their properties (again filtering these). This is easy but results in lots of rows, one for each customer x file x property.

I know that the property names will never changes, and I want to return just a select few, so I used a pivot and resulted in a nice table, but it fell apart after I started doing more complex queries.


THE PROBLEM First, the properties have a DateTime for when they were altered (UpdatedDtm), and I need to return everything altered from 1 hour of the creation date (CreationDtm) in the File table.

This results in me trimming down my list of potential properties, but now I have a table with an RowNumber() per ID and no good way to pivot and select the first one that isn't null and still preserve the number of columns for the table defnition. This is important because I'm using Dynamic SQL and placing it in an indexed temp table with a Composite Key on CustomerID and FileName.


BEFORE PIVOT

| UpdatedDtm | CustomerID | FileName   | Property | Value          |
| ---------- | ---------- | ---------- | -------- | -------------- |
| 1/1/2015   | 1          | FileOne    | Size     | NULL           |
| 1/1/2015   | 1          | FileOne    | Format   | JPG            |
| 1/7/2015   | 1          | FileOne    | Size     | 88KB           |
| 1/7/2015   | 1          | FileOne    | Format   | JPG            |
| 1/7/2015   | 1          | FileOne    | Comment  | NULL           |
| 1/11/2015  | 1          | FileOne    | Comment  | NULL           |
| 1/1/2015   | 1          | FileTwo    | Size     | 91KB           |
| 1/1/2015   | 1          | FileTwo    | Format   | PNG            |
| 1/11/2015  | 1          | FileTwo    | Comment  | NULL           |
| 1/2/2015   | 2          | FileThree  | Size     | 74KB           |
| 1/2/2015   | 2          | FileThree  | Format   | XLS            |
| 1/2/2015   | 2          | FileThree  | State    | Open           |
| 1/7/2015   | 2          | FileThree  | State    | Closed         |
| 1/10/2015  | 2          | FileThree  | Comment  | NULL           |
| 1/1/2015   | 3          | FileFour   | Size     | 2KB            |
| 1/2/2015   | 3          | FileFour   | Size     | 10KB           |
| 1/3/2015   | 3          | FileFour   | Size     | 13KB           |
| 1/4/2015   | 3          | FileFour   | Size     | 21KB           |
| 1/5/2015   | 3          | FileFour   | Size     | 27KB           |
| 1/6/2015   | 3          | FileFour   | Size     | 32KB           |
| 1/7/2015   | 3          | FileFour   | Size     | 39KB           |
| 1/8/2015   | 3          | FileFour   | Size     | 44KB           |
| 1/1/2015   | 3          | FileFour   | Format   | TXT            |
| 1/1/2015   | 3          | FileFour   | Comment  | NULL           |

Please don't ask me why the database is setup this way or to change the schema. That is set in stone and out of my control. I need to be able to solve the use case as described.


AFTER PIVOT (Expectation)

| CustomerID | FileName   | Size | Format | State  | Comment |
| ---------- | ---------- | ---- | ------ | ------ | ------- |
| 1          | FileOne    | 88KB | JPG    | NULL   | NULL    |
| 1          | FileTwo    | 91KB | PNG    | NULL   | NULL    |
| 2          | FileThree  | 74KB | XLS    | Closed | NULL    |
| 3          | FileFour   | 44KB | TXT    | NULL   | NULL    |

I have included some NULL values and missing values to showcase that I need to preserve the same columnar properties regardless of them having data, but I also need to squash the data by the the first non-null value within my date range.


CODE (My attempt)

IF Object_id('tempdb..#FilesQuery') IS NOT NULL DROP TABLE #FilesQuery;
CREATE TABLE #FilesQuery (
    SeqNum          int,
    CustomerID      numeric(16,0),
    FileName        varchar(64),
    PropertyName    varchar(64),
    PropertyValue   varchar(64)
)
INSERT INTO #FilesQuery
SELECT
     CASE WHEN P.[Value] IS NOT NULL
          THEN ROW_NUMBER() OVER (partition by C.CustomerID order by UpdatedDtm)
          ELSE 0
     END as SeqNum,
     C.CustomerID
    ,F.Name  as FileName
    ,P.Name  as PropertyName
    ,P.Value as PropertyValue

FROM Customers C
INNER JOIN Files F ON F.CustomerID = C.CustomerID
LEFT JOIN Properties P
    ON P.CustomerID = C.CustomerID
    AND P.FileID = F.FileID

WHERE F.FileName IN ('FileOne','FileTwo','FileThree','FileFour')
    AND P.Name IN ('Size','Format','State','Comment')

--PIVOT
DECLARE @cols AS nvarchar(MAX)
SELECT @cols = STUFF(
    (SELECT DISTINCT ',' + QUOTENAME(PropertyName)
       FROM #FilesQuery fq
        FOR XML PATH(''),TYPE).value('.','NVARCHAR(MAX)'),1,1,'')

DECLARE @dynSql AS nvarchar(MAX)
SET @dynSql = '
    SELECT DISTINCT *
    FROM (
        SELECT
            fq.CustomerID,
            fq.FileName,
            fq.PropertyName,
            fq.PropertyValue
        FROM #FilesQuery fq
    ) SRC
    PIVOT (
        Max([PropertyValue])
        FOR PropertyName IN (' + @cols + ')
    ) PVT
'

IF Object_id('tempdb..#Results') IS NOT NULL DROP TABLE #Results;
CREATE TABLE #Results (
    CustomerID      varchar(16) NOT NULL,
    FileName        varchar(64) NOT NULL,
    FileSize        varchar(64) NULL,
    FileFormat      varchar(64) NULL,
    FileState       varchar(64) NULL,
    FileComment     varchar(64) NULL,
    CONSTRAINT pk_CustDoc PRIMARY KEY (CustomerID,FileName)
)
INSERT INTO #Results EXEC @dynSql;

I'm sorry this code isn't complete, it is the working section I have. The other tries I made resulted in bad data pulls.

I tried using SeqNum and a combination of case statements to try and select the first non-null value for each row so that the data was all on one line, but it ended up being more like.

FileOne NULL NULL Open NULL
FileOne NULL JPG  NULL NULL

and so on...

I've been struggling on solving this special case for awhile and am about to scrap and it do something procedural with looping, but that would kill my query time and performance.

Anyone have a good solution? Am I over-thinking things?


Solution

  • you should filter your data before you PIVOT and you will get your desired results. Here is a cte version to show you the steps of how to get what you want.

    ;WITH cteDefineRowPrecedence AS (
        SELECT *
           ,ROW_NUMBER() OVER (PARTITION BY CustomerId, FileName, Property ORDER BY
              CASE WHEN Value IS NOT NULL THEN 0 ELSE 1 END
              ,UpdatedDtm DESC) as RowNum
        FROM
           @Table
    )
    
    , cteDesiredRwows AS (
        SELECT
           CustomerId
           ,FileName
           ,Property
           ,Value
        FROM
           cteDefineRowPrecedence t
        WHERE
           t.RowNum = 1
           AND t.Value IS NOT NULL
    )
    
    SELECT *
    FROM
        cteDesiredRwows t
        PIVOT (
           MAX(Value)
           FOR Property IN (Size,[Format],[State],Comment)
        ) p
    ORDER BY
        CustomerId
        ,FileName
    

    And here is a nested query version that will make it easier to embed/put in your dynamic sql....

    SELECT *
    FROM
        (
           SELECT CustomerId, FileName, Property, Value
           FROM
              (SELECT *
                 ,ROW_NUMBER() OVER (PARTITION BY CustomerId, FileName, Property ORDER BY
                    CASE WHEN Value IS NOT NULL THEN 0 ELSE 1 END
                    ,UpdatedDtm DESC) as RowNum
              FROM
                 @Table) r
           WHERE
              r.RowNum = 1
              AND r.Value IS NOT NULL
        ) t
        PIVOT (
           MAX(Value)
           FOR Property IN (Size,[Format],[State],Comment)
        ) p
    ORDER BY
        CustomerId
        ,FileName