Search code examples
javawindowsutf-8nio

Why is the first character of the first line of a file in windows a 0?


So I'm reading a plain text file in Java, and I'd like do identify which lines start with "abc". I did the following:

Charset charset = StandardCharsets.UTF_8;
BufferedReader br = Files.newBufferedReader(file.toAbsolutePath(), charset);
String line;
while ((line = br.readLine()) != null) {
   if (line.startsWith("abc")) {
       // Do something
   }
}

But if the first line of the file is "abcd", it won't match. By debugging I've found out that the first character is a 0 (non-printable character), and because of this it won't match. Why is that so? How could I robustly identify which lines start with "abc"?

EDIT: perhaps I should point out that I'm creating the file using notepad


Solution

  • Windows has a few problems with UTF-8, and as such it is a heavy user of the UTF-8 BOM (Byte Order Mark).

    If my guess is correct, the first three bytes would then be (in hexadecimal): 0xef, 0xbb, 0xbf.

    Given that, for instance, Excel creates UTF-8 CSV files with a BOM prefix, I wouldn't be surprised at all if Notepad did as well...

    edit: not surprisingly, it seems this is the case: see here.