Search code examples
javastringfilecharbufferedreader

Counting the correct number of line breaks in a text file


I'm working through the first exercise in John Crickett's coding challenges, which is to create a wc clone that counts the number of lines, bytes, words and characters in a text file. I'm on Step 4, counting characters.

My code so far is as follows:

public static long countChars(File inputFile) {
    long count = 0;

    try (BufferedReader reader = new BufferedReader(new FileReader(inputFile))) {
        String line;
        while ((line = reader.readLine()) != null)
            {
                line = line.replaceAll("\uFEFF",""); // Remove BOM
                count += line.length();
            }
    } catch (IOException e) {
        e.printStackTrace();
    }

    return count;
}

The issue is that wc returns an output of 339292, whereas my code is returning 325001. My initial suspicion was that my code was simply ignoring line break characters, and I noticed something interesting. The difference between the output of wc and my own code's output is 14291 missing characters, which is double the number of lines plus one.

I am trying to understand the following:

  1. Why are there two line breaks per line rather than per one? Surely every line has a line break at the end of it?
  2. What is the extra single character?
  3. Is it naive to simply double the line count and add it onto the character count (and then +1)? Will this trip me up in some edge case somehow?

Some of the behaviour is a bit strange here. When I get the code to print out the character count for theArtOfWar.txt line by line, without the line of code that removes the BOM, it gives 46 for the character count of the first line. Then when I change it to remove the BOM, it gives 45 , which is what wc gives too.

However, the same behaviour is not reproducible when I remove every line except the first line in a file called theArtOfWarFirstLine.txt. In that case, my code and wc both give 45, regardless of whether I include the line removing the BOM. It's like the act of removing every line except the first one removed the BOM as well.

Hex dump:

The Art of War: (note: just the first ten lines of the hex dump, the whole thing was far too long)

00000000: efbb bf54 6865 2050 726f 6a65 6374 2047  ...The Project G
00000010: 7574 656e 6265 7267 2065 426f 6f6b 206f  utenberg eBook o
00000020: 6620 5468 6520 4172 7420 6f66 2057 6172  f The Art of War
00000030: 0d0a 2020 2020 0d0a 5468 6973 2065 626f  ..    ..This ebo
00000040: 6f6b 2069 7320 666f 7220 7468 6520 7573  ok is for the us
00000050: 6520 6f66 2061 6e79 6f6e 6520 616e 7977  e of anyone anyw
00000060: 6865 7265 2069 6e20 7468 6520 556e 6974  here in the Unit
00000070: 6564 2053 7461 7465 7320 616e 640d 0a6d  ed States and..m
00000080: 6f73 7420 6f74 6865 7220 7061 7274 7320  ost other parts 
00000090: 6f66 2074 6865 2077 6f72 6c64 2061 7420  of the world at 
000000a0: 6e6f 2063 6f73 7420 616e 6420 7769 7468  no cost and with

The Art of War (first line):

00000000: 5468 6520 5072 6f6a 6563 7420 4775 7465  The Project Gute
00000010: 6e62 6572 6720 6542 6f6f 6b20 6f66 2054  nberg eBook of T
00000020: 6865 2041 7274 206f 6620 5761 72         he Art of War

I can for sure see that there is something in the original that is missing from the start of the 'first line' version. I suspect it's the BOM. It's hard to test if wc ignores the BOM or not without manually inserting it into one of these one liner text files, which I'm not sure how to do.


Solution

  • It seems wc is counting \r and \n as a character and you're not.

    Add to that the BOM you are ignoring and it gives the difference you are seeing.

    You should probably not be reading by line if you want to count the \r and the \n. If that's the case, I would read by character, and then keep a state every time you read a \r to ignore the following \n to increment the number of lines. Actually, it seems you are making several passes, one for the lines, another for the characters, and I guess another for the words.

    So you just need this:

    public static long countChars(File inputFile) {
        long count = 0;
    
        try (Reader reader = new FileReader(inputFile)) {
            while (reader.read() > 0) count ++;
        } catch (IOException e) {
            e.printStackTrace();
        }
    
        return count;
    }
    

    But you could solve it in one pass with the suggestion I made about the state for saying the last read was a \r for not counting the following \n.