How can I process a database with unreliable passwords

I found a website haveibeenpwned that makes it possible to check if the password is compromised. A database with passwords is also provided. The database looks like a text file

000000005AD76BD555C1D6D771DE417A4B87E4B4:10
00000000A8DAE4228F821FB418F59826079BF368:4
00000000DD7F2A1C68A35673713783CA390C9E93:873

There are several hundred million values encoded in SHA1, before that the values were converted to uppercase.

I found this:

  public static String generateSHA1(String message) throws RuntimeException {
        return hashString(message, "SHA-1");
    }

    private static String hashString(String message, String algorithm)
            throws RuntimeException {
        try {
            MessageDigest digest = MessageDigest.getInstance(algorithm);
            byte[] hashedBytes = digest.digest(message.getBytes("UTF-8"));

            return convertByteArrayToHexString(hashedBytes);
        } catch (NoSuchAlgorithmException | UnsupportedEncodingException ex) {
            throw new RuntimeException(
                    "Could not generate hash from String", ex);
        }
    }

    private static String convertByteArrayToHexString(byte[] arrayBytes) {
        StringBuilder stringBuffer = new StringBuilder();
        for (byte arrayByte : arrayBytes) {
            stringBuffer.append(Integer.toString((arrayByte & 0xff) + 0x100, 16)
                    .substring(1));
        }
        return stringBuffer.toString();
    }

Should I bring my password (which should be compared with the hashes table) after converting to hash, to a HEX string ?

Is this enough to be able to perform a comparison?

However, I don't understand how to check such a volume :

After the ':' character, what is this number indicated ?
To verify, do I have to translate the password value to uppercase and encode using Message Digest with the SHA1 type ?
However, if I get a hash, then I will have to perform a hash comparison search across the entire database.
And is it possible to somehow allocate the prefix of the hash string and make a selection of suitable values and search among their range using Java, which will reduce the load on the database and reduce the response of the application?

If anyone has any ideas how to do this and what is this hash in the file, please share?

Solution

As per the blog post that explains all this, the :5 is telling you that this password showed up with separate account information 5 separate times. That one with 873 suggests it's a somewhat commonly used password and in that sense 'worse'.

I don't think the passwords are case insensitive. The API that HIBP provides has case insensitive hash searches, i.e. if you search for 000000005ad76BD555C1D6D771DE417A4B87E4B4, it finds a hit just as well as 000000005AD76BD555C1D6D771DE417A4B87E4B4. "Case insensitive" and 'hash it' are not compatible. For a very quick test, check if SHA1(iloveyou) and SHA1(ILOVEYOU) are in the database. if both are, you have your confirmation: Don't mess with cases. If only one is, uppercase (or lowercase) before searching.

Verifying a password's suitability is done as follows:

Take the user's password. Don't mess with its case. SHA1 it. Check if that SHA1 is in this file (the part before the colon). If yes, tell the user to go pick something else and link em to the usual. Especially if the number following the colon is high, that's worse, but appearing at all in this list is grounds to deny that password. That :1234 stuff is mostly for other kinds of research.

However, if I get a cache, then I will have to perform a hash comparison search across the entire database.

Yes; how else did you think this was going to work? Instead of the 6GB or whatever it is file, you can use their API, even cross-site (each IP is rate limited to one request per 1.5 seconds). Yes, your change pass/create new account form can directly ping HIBP with absolutely no involvement from your server, if you want. The HIBP site's API docs explain all this and how to do that. Then they¹ will do a search.

For what its worth, searching all that isn't too difficult assuming you do it correctly, that is, indexed. For example, a binary search (I'm pretty sure that file is 'sorted' on SHA1, so look up hashes the way you look up a phone number in a phone book based on somebody's name: Don't just open the book and start reading from Mr. Aabarth, go to the exact middle, check if the name you find there is 'below' or 'above' what you are looking for. If 'below', then take the bottom half of the phonebook, tear it off, toss it in the garbage. You now have half a phonebook to search.

Continue to apply that algorithm: Each time you toss half of what is left in the garbage. That's a LOG2(n) algorithm and therefore will e.g. process in something like 32 steps total ('get rid of half the phonebook' applied 32 times will get you down to a single entry even for a phonebook that has 4 billion entries in it. LOG2(n) is pretty neat).

dbs do this automatically (or can do even better than that) assuming you add an index to the column.

If you actually write this yourself it is smart to build up an index - say, 100k hashes and their exact position in the giant file. To then look up a hash, check your index for that hash that is 'below' your hash but as close to it as possible (you can use TreeSet in java for this for example). Open the file, seek to that position, and then start processing until you either [A] find that hash (and deny that password as unsuitable), or [B] you find a hash that is 'higher' (sorts above the hash you have), in which case the password is presumably allright to use. Given that there are something like 4 billion passwords in there, you've now reduced the search to scanning through the file for only 4 billion / 100k = 40,000 hashes, which is going to be near instant, and would be faster than doing a full binary search on the entire file (as hopping around disk sectors is usually pricier even with SSDs), and avoids the difficulty of first seeking to a newline to avoid trying to scan for stuff in the middle of a hash.

[1] they is referring to Troy Hunt who basically runs HIBP as a charity. Consider donating to the cause if you use HIBP!