Search code examples
javaarraysbytebufferedreader

Reading from a file using BufferedReader is giving garbage byte values in the String byte array like negative bytes


I am trying to read contents from an srt file and I am using Java's BufferedReader to read the file line by line. The content from the srt file is:

2
00:00:40,665 --> 00:00:44,806
<i>♪ Nants ingonyama ♪</i>

And the code snippet is as follows:

public void parseSubtitles(@NonNull final MultipartFile subtitleFile) {
    InputStream is = subtitleFile.getInputStream();
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    String line;
    while ((line = reader.readLine()) != null) {
        System.out.println(line);
    }
}

while debugging through the code, by adding break points, I found out that while reading the first line 2 , the byte array value is [-3, -1, -3, -1, 50, 0, 0, 0].

Then the next line is just a byte array with value [0]

The next line is then [0, 48, 0, 48, 0, 58, 0, 48, 0, 48, 0, 58, 0, 52, 0, 48, 0, 44, 0, 54, 0, 54, 0, 53, 0, 32, 0, 45, 0, 45, 0, 62, 0, 32, 0, 48, 0, 48, 0, 58, 0, 48, 0, 48, 0, 58, 0, 52, 0, 52, 0, 44, 0, 56, 0, 48, 0, 54, 0] which in this case is the time interval of the subtitle.

This is not the case with other subtitle files as there are no 0 value in the byte array and no garbage lines like a byte array with null value [0].

Any idea on what might be causing this issue?


Solution

  • My guess is that you should be reading as UTF-16. The tell-tale sign is the null byte preceding each non-null one. This would mean the two byte encoding of UTF-16 is redundant for 'ascii' characters, which is why UTF-8 is used more, except in the case of certain languages