Search code examples
javastringhashfilenamesdeterministic

Generate deterministic unique fixed length filename string from multiple input strings


I have multiple Strings that I would like to use to generate a single, fixed length, deterministic string from. I am trying to ensure uniqueness in a database, and also will be using the string for filenames; so I will need to avoid collisions as best as possible, and need to avoid special characters. I also need it to be deterministic so that the same three strings in the same order will produce the same output string.

I thought of concatenating the strings on a known delimiter, and base64 encoding. However that is not fixed length.

I thought of concatenating the strings, getting a hash from that string, and base64 encoding that. However by default base64 has special characters which windoze will complain about, and this seems like bad practice.

Now I'm doing this, which also feels ugly:

protected UUID parseUUID() {
    try {
        MessageDigest digest = MessageDigest.getInstance("SHA-256");
        List<String> strings = new ArrayList<>();
        strings.add(stringOne);
        strings.add(stringTwo);
        strings.add(stringThree);

        strings.removeIf(str -> str == null || str.isEmpty());
        for(int i = 0; i < strings.size(); i++) {
            String string = strings.get(i);
            string = string.replace("|", "\\|");
            strings.set(i, string);
        }
        String input = String.join("|", strings);
        byte[] hash = digest.digest(input.getBytes());

        return UUID.nameUUIDFromBytes(hash);
    } catch(NoSuchAlgorithmException e) {
        return null;
    }
}

What are the odds of collision with this method? What is the best way to generate a deterministic fixed length string suitable for a filename from multiple input strings? Surely this is not it.


Solution

  • The solution I came up with for now is:

    protected String parseHash() {
        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-512");
            List<String> strings = new ArrayList<>();
            strings.add("one");
            strings.add("two");
            strings.add("three");
    
            strings.removeIf(str -> str == null || str.isEmpty());
            for(int i = 0; i < strings.size(); i++) {
                String string = strings.get(i);
                string = string.replace("|", "\\|");
                strings.set(i, string);
            }
            String input = String.join("|", strings);
            byte[] hash = digest.digest(input.getBytes());
            return DatatypeConverter.printHexBinary(hash);
        } catch(NoSuchAlgorithmException e) {
            return null;
        }
    }
    

    As I've read UUID.nameUUIDFromBytes(hash); will compute the md5 of my given hash, which reduces the resolution of the hash. Using the raw hex of the hash seems to be the most elegant way I can think of, but I am of course open to other answers.