Search code examples
sql-serverdatabaset-sqluuidulid

ULID datatype in SQL Server database


While searching for UUIDs I stumbled upon ULIDs https://github.com/ulid/spec

And I want to use them in my SQL Server database as primary key for my users (instead of basic incremental value).

So I wanted to ask:

  1. What datatype is the best suited? uniqueidentifier? binary16? something else? and why?

  2. Are there some other extra steps to fully benefit from ULID structure? especially sorting/searching? in comparison to other UUIDs.


Solution

  • You are after a 16-byte, universally unique, chronologically sortable unique identifier.

    SQL Server already has that: uniqueidentifier.

    Many people are under the mistaken impression that a uniqueidentifer is a random value, and that its use as a clusting key is bad because it causes I/O to be scattered all over the place, torn pages, and poor temporaral locality.

    That is only true for "random" uuids - known as a "type 4" uuids.

    Sequential sortable UUIDs

    But there are other types of UUIDs:

    • Type 1: stuffs MAC address+datetime into 128 bits
    • Type 3: stuffs an MD5 hash into 128 bits
    • Type 4: stuffs random data into 128 bits
    • Type 5: stuffs an SHA1 hash into 128 bits
    • Type 6: sequential sortable UUIDs

    Rather than being built from random data, or an MD5 hash, or an SHA1 hash, we can use a hash based on a timestamp , with a resolution of 100 ns (i.e. 0.0001 ms).

    And similarly to how you can default a uniqueidentifier colunm to a "type 4 random uuid":

    CustomerUUID uniqueidentifier DEFAULT newid()
    

    SQL Server also supports sequential type-1 UUIDs:

    CustomerUUID uniqueidentifier DEFAULT newsequentialid()
    

    And those uuids sort chronologically - as long as they're created on the same machine

    NodeID, MAC address, and Multiple Machines

    There is the issue that the type-1 UUID contains a 6-byte nodeID, which is the client's MAC address.

    00112233-4455-6677-8899-AABBCCDDEEFF

    Timestamp-High Timestamp-Mid Timestamp-Low Sequence NodeID
    00112233 4455 6677 8899 AABBCCDDEEFF

    And SQL Server, for whatever reason, decided to sort UUIDs by the nodeID first.

    SQL Server sorts guids first by f, then e, then d, c, b, a, 9, ..., 2, 1, 0:

    {00112233-4455-6677-9988-ffeeddccbbaa}
    

    In other words, it sorts by:

    • Node[0]
    • Node[1]
    • Node[2]
    • Node[3]
    • Node[4]
    • Node[5]
    • UInt16 Sequence (little endian)
    • UInt16 TimestampLow (big endian)
    • UInt16 TimestampMid (big endian)
    • UInt32 TimestampHigh (big endian)

    Which means:

    • if all the UUIDs are generated by the same machine (whether it be the database server itself, or a web web-server) then the UUIDs will be sequential based on the timestamp
    • but if the UUIDs are generated on different clients, with different nodeIDs (i.e. MAC address) then you'll have them sorted first by node, and then by timestamp

    That may or may not be an issue for you. But if it is, the solution is to shuffle around the bytes of the uniqueidentifier/UUID so that:

    • if SQL Server is gonna sort the 6-byte NodeID first
    • then i'll just put the timestamp there
    • and move the nodeID to where the timestamp was

    In C# we use the following class to generate an "SQL Server sortable UUID". Other databases may sort UUIDs/uniqueidentifiers differently.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Threading;
    using System.Web;
    
    /// <summary>
    /// Summary description for SortableGuid
    /// </summary>
    public class SortableGuid
    {
        //UUID timestamps are 100ns ticks starting from October 15, 1582 (date we switched to Gregorian calendar)
        //Windows FILETIME is 100ns ticks starting from January 1, 1601  (the date of the start of the first 400-year cycle of the Gregorian calendar)
        private static readonly Int64 guidEpochOffset = -5748192000000000; //6653 days --> 159672 hours --> 9580320 minutes --> 574819200 seconds -> 574819200000 ms -> 574819200000000 us --> 5748192000000000 ticks
    
        private static Int64 _lastTick = DateTime.UtcNow.Ticks + guidEpochOffset;
    
        /// <summary>
        /// 
        /// </summary>
        /// <param name="location">The destination, whose value is compared with comparand and possibly replaced.</param>
        /// <param name="comparison">The value that is compared to the value at location1.</param>
        /// <param name="newValue">The value that replaces the destination value if the comparison results in equality.</param>
        /// <returns>true if comparand was greater than location, and location was updated to newValue. 
        /// Otherwise false.</returns>
        public static Boolean InterlockedExchangeIfGreaterThan(ref Int64 location, Int64 newValue, Int64 comparand)
        {
            //Thank you Raymond Chen: https://stackoverflow.com/a/13056904/12597
            Int64 currentValue;
            do
            {
                currentValue = Interlocked.Read(ref location); //a read of a 64-bit location is not atomic on x86. If location was 32-bit you could get away with just "currentValue = location;"
                if (currentValue >= comparand) return false;
            }
            while (System.Threading.Interlocked.CompareExchange(ref location, newValue, currentValue) != currentValue);
            return true;
        }
    
        /// <summary>
        /// Returns a new sortable guid. These guid's are suitable as clustering keys in SQL Server.
        /// </summary>
        /// <returns></returns>
        public static Guid NewGuid()
        {
            /*
                Blatently borrowed from Entity Framework.
                    https://github.com/dotnet/efcore/blob/master/src/EFCore/ValueGeneration/SequentialGuidValueGenerator.cs
    
    
                Except with two differences:
                    - they start with an initial timetime, generated statically once - and keep increasing that. 
                        That causes the "timestamp" to drift further and further from reality. 
                        We generate a timestamp each time, and only rely on incrementing if we're not greater than the last timestamp. 
                        A CPU is capable of generating ~200k GUIDs per 100ns tick - so it's not like we can ignore the duplciate ticks problem.
                    - UUID timestamp ticks start at October 15, 1582 (date of gregorian calendar). 
                        Windows, and DateTime.Ticks returns number of ticks since January 1, 1601 (the date of the first 400 year Gregorian cycle). 
                        We'll offset the timestamp to match the UUID spec.
                    - We place the version number type-7: Sortable by SQL Server with a timestamp.
    
                SQL Server sorts first by the NodeID (i.e. the final 6-bytes):
    
                16-byte guid                                            Microsoft clsid string representation
                ===========--=====--=====--=====--=================     ======================================
                33 22 11 00  55 44  77 66  99 88  ff ee dd cc bb aa ==> {00112233-4455-6677-9988-ffeeddccbbaa}
                \_______________________/  \___/  \_______________/
                Timestamp and Version  ^    Clk        Node ID
    
                The goal is to create a sequential uuid (e.g. UuidCreateSequential), but without having to rely on a MAC address. 
                (i.e. Does an Azure web-site even **have** a MAC address? 
                We certainly can't P/Invoke out to UuidCreateSequental when we're not running on Windows)
    
                So we conceptually move the 8-byte timestamp to it's new location in NodeID+ClockSequence
    
                And what used to be Timestamp+Version becomes random.
    
                And, like type-4 Uuids being called Type-4 to help reduce the chances of collisions between types,
                we call this new version type-7 (sortable by SQL Server with a timestamp).
            */
    
            //Start from a new random (type 4) uuid. 
            Byte[] guidBytes = Guid.NewGuid().ToByteArray();
    
            //Generate 8-byte timestamp
            Int64 currentTicks = DateTime.UtcNow.Ticks + guidEpochOffset;
            //Make sure that currentTicks is greater than the last tick count (in case they're generating guids faster than 100ns apart)
            Int64 last;
            do
            {
                last = Interlocked.Read(ref _lastTick); //a read of a 64-bit value isn't atomic; so it has to be interlocked.
                if (currentTicks <= last)
                    currentTicks = last + 1;
            } while (Interlocked.CompareExchange(ref _lastTick, currentTicks, last) != last);
    
    
            Byte[] counterBytes = BitConverter.GetBytes(currentTicks);
    
            if (!BitConverter.IsLittleEndian)
            {
                Array.Reverse(counterBytes);
            }
    
            //This Guid started it's life as a Type 4 (Random) guid, but the final 8 bytes were changed to contain a sequential counter.
            //I want to tag this as a different kind of Uuid algorithm. In Delphi we already called this a Type 7 uuid.
            guidBytes[07] = (Byte)(0x70 | (guidBytes[07] & 0x0f));
    
            //Put the timestamp in place - in the proper SQL Server sorting order.
            guidBytes[08] = counterBytes[1];
            guidBytes[09] = counterBytes[0];
            guidBytes[10] = counterBytes[7];
            guidBytes[11] = counterBytes[6];
            guidBytes[12] = counterBytes[5];
            guidBytes[13] = counterBytes[4];
            guidBytes[14] = counterBytes[3];
            guidBytes[15] = counterBytes[2];
    
            Guid result = new Guid(guidBytes);
            return result;
        }
    }