Search code examples
javautf-8filestreamchars

Character digit not true when read from UTF-8 file


So im using a scanner to read a file. However i dont understand that if the file is a UTF-8 file, and the current line being read when iterating over the file, is containing a digit, the method Character.isDigit(line.charAt(0)) returns false. However if the file is not a UTF-8 file the method returns true.

Heres some code

File theFile = new File(pathToFile);
Scanner fileContent = new Scanner(new FileInputStream(theFile), "UTF-8");
while(fileContent.hasNextLine())
{
    String line = fileContent.nextLine();
    if(Character.isDigit(line.charAt(0)))
    {
         //When the file being read from is NOT a UTF-8 file, we get down here
    }

When using the debugger and looking at the line String, i can see that in both cases (UTF-8 file or not) the string seems to hold the same, a digit. Why is this happening?


Solution

  • As finally found by exchanging comments, your file includes a BOM. This is generally not recommended for UTF-8 files because Java does not expect it and sees it as data.

    So there are two options you have:

    1. if you are in control of the file, reproduce it without the BOM

    2. If not, then check the file for BOM existence and remove it before proceeding to other operations.

    Here is some code to start. It rather skips than removes the BOM. Feel free to modify as you like. It was in some test utility I had written some years ago:

    private static InputStream filterBOMifExists(InputStream inputStream) throws IOException {
            PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
            byte[] bom = new byte[3];
            if (pushbackInputStream.read(bom) != -1) {
                if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
                    pushbackInputStream.unread(bom);
                }
            }
            return pushbackInputStream;
        }