Search code examples
javauuid

How to convert a UUID to a number with less than 40 digits?


I need to convert a UUID (like ffffffff-ffff-ffff-ffff-ffffffffffff) to a number with as least digits as possible. These UUIDs are being generated by com.fasterxml.uuid.impl.TimeBasedGenerator.java.

package com.fasterxml.uuid.impl;

import java.util.UUID;

import com.fasterxml.uuid.*;

/**
 * Implementation of UUID generator that uses time/location based generation
 * method (variant 1).
 *<p>
 * As all JUG provided implementations, this generator is fully thread-safe.
 * Additionally it can also be made externally synchronized with other
 * instances (even ones running on other JVMs); to do this,
 * use {@link com.fasterxml.uuid.ext.FileBasedTimestampSynchronizer}
 * (or equivalent).
 *
 * @since 3.0
 */
public class TimeBasedGenerator extends NoArgGenerator
{
    /*
    /**********************************************************************
    /* Configuration
    /**********************************************************************
     */

    protected final EthernetAddress _ethernetAddress;

    /**
     * Object used for synchronizing access to timestamps, to guarantee
     * that timestamps produced by this generator are unique and monotonically increasings.
     * Some implementations offer even stronger guarantees, for example that
     * same guarantee holds between instances running on different JVMs (or
     * with native code).
     */
    protected final UUIDTimer _timer;

    /**
     * Base values for the second long (last 8 bytes) of UUID to construct
     */
    protected final long _uuidL2;
    
    /*
    /**********************************************************************
    /* Construction
    /**********************************************************************
     */

    /**
     * @param ethAddr Hardware address (802.1) to use for generating
     *   spatially unique part of UUID. If system has more than one NIC,
     */
    
    public TimeBasedGenerator(EthernetAddress ethAddr, UUIDTimer timer)
    {
        byte[] uuidBytes = new byte[16];
        if (ethAddr == null) {
            ethAddr = EthernetAddress.constructMulticastAddress();
        }
        // initialize baseline with MAC address info
        _ethernetAddress = ethAddr;
        _ethernetAddress.toByteArray(uuidBytes, 10);
        // and add clock sequence
        int clockSeq = timer.getClockSequence();
        uuidBytes[UUIDUtil.BYTE_OFFSET_CLOCK_SEQUENCE] = (byte) (clockSeq >> 8);
        uuidBytes[UUIDUtil.BYTE_OFFSET_CLOCK_SEQUENCE+1] = (byte) clockSeq;
        long l2 = UUIDUtil.gatherLong(uuidBytes, 8);
        _uuidL2 = UUIDUtil.initUUIDSecondLong(l2);
        _timer = timer;
    }
    
    /*
    /**********************************************************************
    /* Access to config
    /**********************************************************************
     */

    @Override
    public UUIDType getType() { return UUIDType.TIME_BASED; }

    public EthernetAddress getEthernetAddress() { return _ethernetAddress; }
    
    /*
    /**********************************************************************
    /* UUID generation
    /**********************************************************************
     */
    
    /* As timer is not synchronized (nor _uuidBytes), need to sync; but most
     * importantly, synchronize on timer which may also be shared between
     * multiple instances
     */
    @Override
    public UUID generate()
    {
        final long rawTimestamp = _timer.getTimestamp();
        // Time field components are kind of shuffled, need to slice:
        int clockHi = (int) (rawTimestamp >>> 32);
        int clockLo = (int) rawTimestamp;
        // and dice
        int midhi = (clockHi << 16) | (clockHi >>> 16);
        // need to squeeze in type (4 MSBs in byte 6, clock hi)
        midhi &= ~0xF000; // remove high nibble of 6th byte
        midhi |= 0x1000; // type 1
        long midhiL = (long) midhi;
        midhiL = ((midhiL << 32) >>> 32); // to get rid of sign extension
        // and reconstruct
        long l1 = (((long) clockLo) << 32) | midhiL;
        // last detail: must force 2 MSB to be '10'
        return new UUID(l1, _uuidL2);
    }
}

I expect that every day at least 12 million UUIDs will be generated on a particular machine. I assume that TimeBasedGenerator will be instatiated once when the program starts and thus will have a constant network address and a constant random generator seed time (value now in the code snipped below).

final long now = 1644575806478L;

final TimeBasedGenerator sut = new TimeBasedGenerator(EthernetAddress.fromInterface(),
  new UUIDTimer(new Random(now), null));

How can I turn a UUID into a number with

a) less than 40 digits and

b) so that duplicate UUIDs occur once in 6 months the earliest?

Failed attempt 1

The simplest way to convert a UUID to a number looks like this:

import java.math.BigInteger;
import java.util.function.Function;

public class OldUUIDConverter implements Function<String, BigInteger> {
    @Override
    public BigInteger apply(final String uuid) {
        final String uuidWithoutDashes = uuid.replaceAll("-", "");
        return new BigInteger(uuidWithoutDashes, 16);
    }
}

For the largest possible UUID (ffffffff-ffff-ffff-ffff-ffffffffffff) it producees the number

340282366920938463463374607431768211455
^         ^         ^         ^         ^

Because we convert the UUID directly, it won't be repeated if the UUID generator does not produce duplicate UUIDs too frequently.

The criterion b) of the above requirement is satisfied. But criterion a) -- the length of the converted number -- is not.

Failed attempt 2

Someone proposed to only take the first 8 digits of the UUID. Thus the criterion a) will be satisfied.

However, the first 8 digits of the UUID repeat too frequently.

I wrote a simple test program:

import com.fasterxml.uuid.EthernetAddress;
import com.fasterxml.uuid.UUIDTimer;
import com.fasterxml.uuid.impl.TimeBasedGenerator;
import org.junit.Test;

import java.io.IOException;
import java.util.HashSet;
import java.util.Random;
import java.util.Set;

public class UUIDShortenerTest {
    @Test
    public void test() throws IOException {
        final Set<String> first8DigitsAsLong = new HashSet<>(73000000);

        final long now = 1644575806478L;

        final TimeBasedGenerator sut = new TimeBasedGenerator(EthernetAddress.fromInterface(), new UUIDTimer(new Random(now), null));

        for (int i=0; i < Integer.MAX_VALUE; i++) {
            final String curUuid = sut.generate().toString();
            final String first8Digits = curUuid.substring(0, 8);
            if ((i % 1000000L) == 0) {
                System.out.println(String.format("Counter: %d", i));
            }
            if (!first8DigitsAsLong.add(first8Digits)) {
                System.out.println("Duplicate!");
                System.out.println(i);
                break;
            }
        }
    }
}

I ran it 10 times. Every time I wrote down the number of UUID generations after which the first 8 digits are repeated.

Here are the results:

  1. Duplicate first 8 digits found after 74499376 UUID generations.
  2. 44259478
  3. 45365652
  4. 45031094
  5. 45445463
  6. 46250299
  7. 45403800
  8. 44658612
  9. 46098250
  10. 43748051

Let's assume that the first 8 digits are repeated after I generate 75 million UUIDs.

If I generate 12 million UUIDs daily, this means that the first UUID with duplicate first 8 digits will be generated after 62-63 days (75000000/1200000=62.5 days).

These 62.5 days (around two months) are lower than the requirement of 6 months before UUID repetition.

Options I am aware of

There is a number of this I can do to fix the issue.

One would be to increase the number of first digits that I use until the frequency of UUID repetitions reaches the desired level. That is, I can try to use 9 first digits instead of 8 and see if it's good enough. If it doesn't, I can use 10 first digits etc.

Another would be to use the code from failed attempt #1 and encode the resulting number 340282366920938463463374607431768211455 in a numerical system with a greater basis.

Let's say I use numbers 0-9, Latin lowercase and uppercase letters, plus 14 special characters. This results in 10+26+26+14=76 numbers that each digit can have. I assume that the above number in a base 76 system will be shorter than the same number in decimal or hexadecimal system.

Are there any other, better options (e. g. hash functions with sufficient degree of diversity)?


Solution

  • This is the birthday problem. Theoretically, you can have a UUID collision after generating just two UUIDs, but there are convenient probability tables. You are trying to generate 12*10^6*30*6 ~= 2*10^9 ids. How much are you willing to risk?