Search code examples
javalambdajava-8java-streamhex

Is there a way to convert Hex string to bytes using Java streams?


The code snippet below in the long run always results in an Out-Of-Memory error, especially when reading from a very bulky file/content.

Is there another way to re-write this especially using streams?

I saw a way to convert bytes array to hex string here: Effective way to get hex string from a byte array using lambdas and streams

public static byte[] hexStringToBytes(String hexString) {
        if (LOGGER.isDebugEnabled()) {
            LOGGER.debug("Hex string to convert to byte[] " + hexString);
        }
        byte[] buf = new byte[hexString.length() / 2];
        String twoDigitHexToConvertToByte;
        for (int i = 0; i < buf.length; i++) {
            twoDigitHexToConvertToByte = extractPairFromStringBasedOnIndex(hexString, i);
            parseStringToBytesAndStoreInArrayOnIndex(twoDigitHexToConvertToByte, buf, i);
        }

        return buf;
    }

 private static void parseStringToBytesAndStoreInArrayOnIndex(String twoDigitHexToConvertToByte, byte[] buf, int i) {
        try {
            buf[i] = (byte) Integer.parseInt(twoDigitHexToConvertToByte, HEX_RADIX);
        } catch (NumberFormatException e) {
            if (LOGGER.isDebugEnabled()) {
                LOGGER.info("Tried to convert non hex string:", e);
            } else {
                LOGGER.info("Tried to convert non hex string:" + e.getMessage());
            }

            throw new HexStringToBytesException("Tried to convert non hex string"); // NOSONAR xlisjov don't want original cause since it caused exceptions.
        }
    }

private static String extractPairFromStringBasedOnIndex(String hexString, int pairNumber) {
        return hexString.substring(2 * pairNumber, 2 * pairNumber + 2);
    }

Solution

  • The simplest way to convert a hex string to a byte array, is JDK 17’s HexFormat.parseHex(…).

    byte[] bytes = HexFormat.of().parseHex("c0ffeec0de");
    System.out.println(Arrays.toString(bytes));
    System.out.println(HexFormat.of().formatHex(bytes));
    
    [-64, -1, -18, -64, -34]
    c0ffeec0de
    

    This is the most convenient method, as can also handle formatted input, e.g.

    byte[] bytes = HexFormat.ofDelimiter(" ").withPrefix("0x")
        .parseHex("0xc0 0xff 0xee 0xc0 0xde");
    

    Note that if you have to process an entire file, even a straight-forward

    String s = Files.readString(pathToYourFile);
    byte[] bytes = HexFormat.of().parseHex(s);
    

    may run with reasonable performance, as long as you have enough temporary memory. If the preconditions are met, which is the case for ASCII based charsets and hex strings, the readString method will read into an array which will become the resulting string’s backing buffer. In other words, the implicit copying between buffers, intrinsic to other approaches, is skipped.

    There’s some time spent in checking the preconditions though, which we could skip:

    String s = Files.readString(pathToYourFile, StandardCharsets.ISO_8859_1);
    byte[] bytes = HexFormat.of().parseHex(s);
    

    This enforces the same encoding used by the compact strings since JDK 9. Since hex strings consist of ASCII characters only, it will correctly interpret all sources whose charset is ASCII based¹. Only for incorrect sources, a misinterpretation of the wrong characters may occur in the exception message.

    It’s hard to beat that and if using JDK 17 is an option, trying an alternative is not worth the effort. But if you are using an older JDK, you may parse a file like

    byte[] bytes;
    try(FileChannel fch = FileChannel.open(pathToYourFile, StandardOpenOption.READ)) {
        bytes = hexStringToBytes(fch.map(MapMode.READ_ONLY, 0, fch.size()));
    }
    
    public static byte[] hexStringToBytes(ByteBuffer hexBytes) {
        byte[] bytes = new byte[hexBytes.remaining() >> 1];
        for(int i = 0; i < bytes.length; i++)
            bytes[i] = (byte)((Character.digit(hexBytes.get(), 16) << 4)
                             | Character.digit(hexBytes.get(), 16));
        return bytes;
    }
    

    This does also utilize the fact that hex strings are ASCII based, so unless you use a rather uncommon charset/encoding, we can process the file data short-cutting the charset conversions. This approach will also work if there’s not enough physical memory to keep the entire file, but then, the performance will be lower, of course.

    The file also must not be larger than 2GiB to use a single memory mapping operation. Performing the operation in multiple memory mapping steps is possible, but you’ll soon run into the array length limit for the result, so if that’s an issue, you have to rethink the entire approach anyway.

    ¹ so this won’t work for UTF-16 nor EBCDIC, the only two counter examples you might have to deal with in real life, though even these are very rare.