Search code examples
sql-serverazure-sql-databasecolumnstore

Clustered column store not aligning GUIDs to segments despite ordering


I have a data set that I want to place in a clustered column store and optimize it for segment elimination when accessing it by a column called SubjectId which is defined as uniqueidentifier type.

I'm using an Azure SQL database with a compatibility level of 16. I have confirmed this same behavior in SQL Server Developer Edition 2022. According to the documentation, this version should support segment elimination based on a uniqueidentifier column.

Following various best practices, I first take my data set and create a rowstore clustered index ordered by that SubjectId:

CREATE CLUSTERED INDEX [MyData_CCI] 
ON [dbo].[MyData_CCS] (SubjectId)
WITH (MAXDOP = 1);

I then create the clustered document store using the DROP EXISTING option:

CREATE CLUSTERED COLUMNSTORE INDEX [MyData_CCI]
ON [dbo].[MyData_CCS]
WITH (DROP_EXISTING = ON, MAXDOP = 1);

The resulting segments don't appear to be aligned whatsoever to this SubjectId:

enter image description here

And when I try and query the data using a single SubjectId in the where clause, the statistics show zero segment elimination:

(44 rows affected)
Table 'MyData_CCS'. Scan count 1, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 2769, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0. Table 'MyData_CCS'. Segment reads 77, segment skipped 0.

What am I doing wrong here? Isn't SQL Server 2022+ (and therefore SQL Azure?) supposed to support uniqueidentifier for segment elimination and predicate push down? As I said, this also happens with SQL Dev Edition 2022, so I don't think it's an Azure thing.

You can repro this yourself with the following schema/data generation script:

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyData]') AND type in (N'U'))
DROP TABLE [dbo].[MyData]
GO
CREATE TABLE [dbo].[MyData](
    [SubjectId] [uniqueidentifier] NOT NULL,
    [SomeDateTime] [datetime] NOT NULL,
    [SomeInteger] [int] NOT NULL,
    [SomeString] [nvarchar](max) NOT NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

-- Create a temporary table to store unique SubjectId values
DROP TABLE #SubjectIds
CREATE TABLE #SubjectIds (
    SubjectId UNIQUEIDENTIFIER
);
GO

WITH NumberSeries AS (
    SELECT 1 AS Number
    UNION ALL
    SELECT Number + 1
    FROM NumberSeries
    WHERE Number < 42000
)
INSERT INTO #SubjectIds (SubjectId)
SELECT NEWID()
FROM NumberSeries
OPTION (MAXRECURSION 0);

-- Insert 21 million rows using cross join
INSERT INTO MyData (SubjectId, SomeDateTime, SomeInteger, SomeString)
SELECT 
    s.SubjectId,
    DATEADD(SECOND, ABS(CHECKSUM(NEWID()) % 31536000), '2020-01-01'), -- Random date within a year
    ABS(CHECKSUM(NEWID()) % 1000000), -- Random integer
    REPLICATE(N'A', 100) -- Example string, adjust as needed
FROM 
    #SubjectIds s
    CROSS JOIN (SELECT TOP (500) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master..spt_values) AS x;

CREATE CLUSTERED INDEX [MyData_CCI] 
ON [dbo].[MyData] (SubjectId)
WITH (MAXDOP = 1, DROP_EXISTING = ON);
GO

CREATE CLUSTERED COLUMNSTORE INDEX [MyData_CCI]
ON [dbo].[MyData]
WITH (DROP_EXISTING = ON, MAXDOP = 1);

--Show Segment data
select s.Name as SchemaName, 
       t.Name as TableName,
       i.Name as IndexName,
       c.name as ColumnName,
       c.column_id as ColumnId,
       cs.segment_id as SegmentId,
       cs.min_data_id as MinValue,
       cs.max_data_id as MaxValue
  from sys.schemas s
  join sys.tables t
    on t.schema_id = s.schema_id
  join sys.partitions as p  
    on p.object_id = t.object_id   
  join sys.indexes as I
    on i.object_id = p.object_id
   and i.index_id = p.index_id
  join sys.index_columns as ic
    on ic.[object_id] = I.[object_id]
   and ic.index_id = I.index_id   
  join sys.columns c
    on c.object_id = t.object_id
   and c.column_id = ic.column_id
  join sys.column_store_segments cs
    on cs.hobt_id = p.hobt_id
   and cs.column_id = ic.index_column_id 
WHERE t.Name = 'MyData' AND c.Name = 'SubjectId'
ORDER BY cs.segment_id

Update/Resolution

This was resolved via discovering that the min/max id meta data is only for numeric types. For GUIDs, strings, etc. post SQL 2022, you need to look at the min_deep_data and max_deep_data columns. In my case, these were not populated despite having created these column stores brand new on a level 16 compatibility DB in SQL Azure. I do not know why that's the case, but it is.

I was forced to do a REBUILD with MAXDOP 1 on the affected column stores to populate this meta data, after which I saw more appropriate segment elimination. Specifically:

ALTER INDEX [MyData_CCI] ON [dbo].[MyData] REBUILD WITH (MAXDOP = 1);

Solution

  • I dunno, this worked fine for me.

    IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyData]') AND type in (N'U'))
    DROP TABLE [dbo].[MyData]
    GO
    CREATE TABLE [dbo].[MyData](
        [SubjectId] [uniqueidentifier] NOT NULL,
        [SomeDateTime] [datetime] NOT NULL,
        [SomeInteger] [int] NOT NULL,
        [SomeString] [nvarchar](max) NOT NULL
    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
    GO
    
    -- Create a temporary table to store unique SubjectId values
    DROP TABLE #SubjectIds
    CREATE TABLE #SubjectIds (
        SubjectId UNIQUEIDENTIFIER
    );
    GO
    
    WITH NumberSeries AS (
        SELECT 1 AS Number
        UNION ALL
        SELECT Number + 1
        FROM NumberSeries
        WHERE Number < 42000
    )
    INSERT INTO #SubjectIds (SubjectId)
    SELECT NEWID()
    FROM NumberSeries
    OPTION (MAXRECURSION 0);
    
    -- Insert 21 million rows using cross join
    INSERT INTO MyData (SubjectId, SomeDateTime, SomeInteger, SomeString)
    SELECT 
        s.SubjectId,
        DATEADD(SECOND, ABS(CHECKSUM(NEWID()) % 31536000), '2020-01-01'), -- Random date within a year
        ABS(CHECKSUM(NEWID()) % 1000000), -- Random integer
        REPLICATE(N'A', 100) -- Example string, adjust as needed
    FROM 
        #SubjectIds s
        CROSS JOIN (SELECT TOP (500) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master..spt_values) AS x;
    
    CREATE CLUSTERED INDEX [MyData_CCI] 
    ON [dbo].[MyData] (SubjectId)
    WITH (MAXDOP = 1);
    GO
    
    CREATE CLUSTERED COLUMNSTORE INDEX [MyData_CCI]
    ON [dbo].[MyData]
    WITH (DROP_EXISTING = ON, MAXDOP = 1);
    
    set statistics io on
    
    select *
    from Mydata
    where SubjectId = '15A4815B-DDD3-48BB-84DE-A578663D8D91'
    
    
    /*
    Table 'MyData'. Scan count 4, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 51224, lob physical reads 9, lob page server reads 0, lob read-ahead reads 80541, lob page server read-ahead reads 0.
    Table 'MyData'. Segment reads 10, segment skipped 11.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    
    */
    
    SELECT @@version
    
    Microsoft SQL Server 2022 (RTM) - 16.0.1000.6 (X64) 
    

    Maybe the segment thingy depends on which GUIDs you are selecting?

    Edit, had to do a new batch:

    select *
    from Mydata
    where SubjectId = 'B52D5BC0-33B6-44F9-87C3-1CAFD7FDDEA5'
    

    Output:

    SchemaName TableName IndexName ColumnName ColumnId SegmentId MinValue MaxValue min_deep_data max_deep_data encoding_type primary_dictionary_id secondary_dictionary_id row_count
    dbo MyData MyData_CCI SubjectId 1 0 26 24899 100031AAB70512CE9244B9092BFA4B2EF4B1 100035979016BB401242B0796DC167CB1F9E 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 1 27 25837 100035979016BB401242B0796DC167CB1F9E 1000203A7982506BAF45A5CA963EBE4637AF 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 2 24 26749 1000203A7982506BAF45A5CA963EBE4637AF 10004FD8C586BF0F5D4BB109D296FDD032B3 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 3 25 27685 10004FD8C586BF0F5D4BB109D296FDD032B3 1000EBD81ADB9A00F94DA589FE85EC39C5C3 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 4 22 28601 1000B8640879660E6541A25E04E74363E39C 1000F8CC2A1D96B30B4C8687FF1DAC082C2C 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 5 19 29482 10003F8ACDB56E2CBB4E9C7935F65E78D3A9 1000CD9C58D6438CB845A3CE79EC437123E9 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 6 20 30368 1000CD9C58D6438CB845A3CE79EC437123E9 1000006AB21D01564048B767D68755ED8FFF 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 7 21 31299 1000006AB21D01564048B767D68755ED8FFF 1000F151931B7AA63844A40FE37382B66DA4 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 8 17 32227 10002998463928C0034F8DC7067F49C9C803 1000E8AF680F3BA2D340A76AFFFE92C5D579 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 9 18 33119 1000C14ABF7C819F784E952E3BD91E0D4059 10008733320FD434364890D9946020C5621E 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 10 15 33983 10008733320FD434364890D9946020C5621E 10000A7AFD6104DE574C9561F4634A3D7AEE 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 11 13 34899 1000C6A9B411DD37C941A17E00057ABE0045 1000750EC694DA9E964CBC4CF5956C3450BE 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 12 14 35785 10000F82C113010FAB4699210D5873CEF401 1000A2093746B569CE43A5DD4E611EC2232B 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 13 11 36677 1000A2093746B569CE43A5DD4E611EC2232B 1000ECAD306ED4688C4686719B0832A5C345 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 14 12 37569 1000ECAD306ED4688C4686719B0832A5C345 1000D8255C9755EB7E41BD5AA804D4D9DF14 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 15 9 38474 1000A155D38E3D1A8540884B1115D3C54DD2 100012931F70BB180A4EBD61C7D33663A9F6 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 16 7 39324 1000EC8501B92CE62847875B143C5C24915B 10006FC55B8556D70E4D84D85BC913E5F067 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 17 8 40220 10006FC55B8556D70E4D84D85BC913E5F067 10001528C2B94ED06E40B365ADB168364FC4 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 18 5 41104 10001528C2B94ED06E40B365ADB168364FC4 1000A4772319590F5E48BCBCBA2771741A25 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 19 4 41981 10005578600DA8DCA34A801A16570EDAEBB5 100012931F70BB180A4EBD61C7D33663A9F6 3 0 1 1048576
    dbo MyData MyData_CCI SubjectId 1 20 532 42006 1000DDEAACB5C5E57D4C9B131CAF92E77395 1000AFF3D2B670ACBB46B04E1CEB27BFFAD0 3 0 1 28480
    Table 'MyData'. Scan count 4, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 10751, lob physical reads 0, lob page server reads 0, lob read-ahead reads 10, lob page server read-ahead reads 0.
    Table 'MyData'. Segment reads 8, segment skipped 13.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.