Search code examples
javafile-writing

FileWriter somehow write in chinese


Please help me with this problem. I'm trying to write a code that read a .txt file and then it would count the frequencies of each letter in the file. This is what I came up with :

public static void charCount(String file) throws IOException {
        FileReader fr = new FileReader(file);
        BufferedReader br = new BufferedReader(fr);

        int[] count = new int[26];
        String line;
        while ((line = br.readLine()) != null) {
            line = line.toUpperCase();
            char[] characters = line.toCharArray();
            for (int i = 0; i < line.length(); i++) {
                if ((characters[i] >='A') && (characters[i] <='Z')) {
                    count[characters[i] - 'A']++;
                }
            }
        }
        File file2 = new File("D:/Project/Aufgabe/Winter_2019/frequency.txt");
        file2.createNewFile();
        FileWriter fw = new FileWriter(file2);
        for (int i = 0; i < 26; i++) {
            fw.write(((char)(i + 'A')) + ": " + count[i]);
        }
        fw.close();
        br.close();
    }

When I tried to print the result in the console with System.out.println(), it gives out these results:

A: 15
B: 4
C: 9
D: 10
E: 2
F: 1
G: 0
H: 3
I: 5
J: 6
K: 3
L: 0
M: 2
N: 7
O: 3
P: 1
Q: 1
R: 0
S: 4
T: 0
U: 2
V: 0
W: 5
X: 0
Y: 1
Z: 0

Which is what I want. But when I tried to write it in a file, it gives it these results in the .txt file:

㩁ㄠ䈵›䌴›䐹›〱㩅㈠㩆ㄠ㩇〠㩈㌠㩉㔠㩊㘠㩋㌠㩌〠㩍㈠㩎㜠㩏㌠㩐ㄠ㩑ㄠ㩒〠㩓㐠㩔〠㩕㈠㩖〠㩗㔠㩘〠㩙ㄠ㩚〠

I'm still new to java, so a help would be much appreciated.


Solution

  • While there are a couple of things about your program that can be improved, none of them are the reason why you see chinese characters. In fact your program seems to work just fine and the resulting file actually contains the text you've seen when trying it with System.out.println.

    I've copied your output example, pasted it into a new file using Notepad and after saving, had a look at the file using a HEX-editor (here HxD). The hex data started like this: FF FE 41 3A 20 31 35 42... which "translates" to ÿþA: 15B.... That's exactly your expected result plus a BOM (Byte Order Marker) that was created by Notepad while saving the file and is therefor not part of the original data.

    So why do you see the strange result? Reason is not your program but the text viewer you're using. Many of these try to do a an educated guess if the file misses a BOM to decide if (in case of Windows Notepad) a file should be read with cp1252 (Windows Latin-1), UTF-8 or Unicode/UTF-16. There are different algorithms so it's hard to say why your viewer decided that this might be UTF-16 but that's the way it is ;-)

    I have a guess and a fix for your problem might be to change

    fw.write(((char)(i + 'A')) + ": " + count[i]);
    

    to

    fw.write(((char)(i + 'A')) + ": " + count[i] + "\r\n");
    

    Alternatively write the file using a charset that includes a BOM, e.g. UTF-8 or UTF-16. With Java 11 you can do that with FileWriter directly (there is a new constructor that allows to set that), if you have to use an older version of Java, you need to use OutputStreamWriter:

    OutputStreamWriter fw = new OutputStreamWriter(new FileOutputStream(file2), "UTF8");
    

    Also: Check your text viewer if the "Open File"-dialog allows you to specify the Charset explicitly, Notepad on a german Windows system calls the Option "Codierung" and "ANSI" is "cp1252" (the charset your Java Virtual Machine should have used when using FileWriter without specific charset).