Search code examples
javalinuxgnu-coreutils

Counting characters, a Java program and wc yield inconsistent results


I wrote a java program that counts the number of characters in a file. To check that the program is working correctly, I type this into the command line (linux) to check the number of characters:

wc -m fileName

from the man page for wc, I know that the newline character is included in the count.

Here is my java program:

import java.io.IOException;
import java.io.File;
import java.util.Scanner;

public class NumOfChars {
  /** The main method. */
  public static void main(String[] args) throws IOException {
    // Check that command is entered correctly
    if (args.length != 1) {
      System.out.println("Usage: java NumOfChars fileName");
    }

    // Check that source file exists
    File file = new File(args[0]);
    if (!file.exists()) {
      System.out.printf("File %s does not exist\n", file);
    }

    // Create Scanner object
    Scanner input = new Scanner(file);

    int characters = 0;
    while (input.hasNext()) {
      
      String line = input.nextLine();

      // The number of characters is the length of the line plus the newline character
      characters += line.length() + 1;
    }
    input.close();

    // Print results
    System.out.printf("File %s has\n", args[0]);
    System.out.printf("%d characters\n", characters);
  }
}

The issue I'm having is that sometimes the number of characters reported from using the java program is different from the number I get when using the wc command.

Here are two examples:

One that works. The contents of the file text.txt is

This is some text
This is some text
This is some text
This is some text
This is some text
This is some text
This is some text
This is some text

The command wc -m text.txt tells me that this file has 144 characters. This is good because when I execute the java program java NumOfChars text.txt, I am also told that the file has 144 characters.

One that doesn't work. The contents of file Exercise06.java is

import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

/** Converts a hexadecimal to a decimal. */
public class Exercise06 {
  /** Main method */
  public static void main(String[] args) {
    // Create a Scanner
    Scanner input = new Scanner(System.in);

    // Prompt the user to enter a string
    System.out.print("Enter a hex number: ");
    String hex = input.nextLine();
    
    // Display result
    System.out.println("The decimal value for hex number "
      + hex + " is " + hexToDecimal(hex.toUpperCase()));
  }
  

  /** Converts hexadecimal to decimal.
      @param hex The hexadecimal
      @return The deciaml value of hex
      @throws NumberFormatException if hex is not a hexadecimal
    */
  public static int hexToDecimal(String hex) throws NumberFormatException {
    // Check if hex is a hexadecimal. Throw Exception if not.
    boolean patternMatch = Pattern.matches("[0-9A-F]+", hex);
    if (!patternMatch) 
      throw new NumberFormatException();

    // Convert hex to a decimal
    int decimalValue = 0;
    for (int i = 0; i < hex.length(); i++) {
      char hexChar = hex.charAt(i);
      decimalValue = decimalValue * 16 + hexCharToDecimal(hexChar);
    }
    // Return the decimal
    return decimalValue;
  }
  
  
  /** Converts a hexadecimal Char to a deciaml.
      @param ch The hexadecimal Char
      @return The decimal value of ch
    */
  public static int hexCharToDecimal(char ch) {
    if (ch >= 'A' && ch <= 'F')
      return 10 + ch - 'A';
    else // ch is '0', '1', ..., or '9'
      return ch - '0';
  }
}

The command wc -m Exercise06.java tells me that this file has 1650 characters. However, when I execute the java program java NumOfChars Exercise06.java, I am told that the file has 1596 characters.

I can't seem to figure out what I'm doing wrong. Can anyone provide me with some feedback?

**EDIT: Here is what I get when typing in head -5 Exercise06.java | od -c enter image description here


Solution

  • There are several possible explanations:

    • It is possible that each line ends with more than one character, for example on Windows each line ends with CR + LF, whereas your program always counts exactly 1 line ending character.

    • wc may assume a different character encoding than your program, possibly leading to different character counts for multi-byte characters.