I'm openning a file with
private String getStringFromFile(File file) {
try {
return Files.readString(Paths.get(file.getPath()), StandardCharsets.US_ASCII);
}
catch (Exception e) {
System.out.println("Error while reading: " + file.getName());
return "";
}
}
and even though the file seems to be clearly ASCII compatible, I'm getting Error while reading: fileName
.
The file looks like this:
The code works if I manually delete the header (the part with square brackets) before openning it (I'm deleting them anyway in the code later). Is there a way of extending the scope of charsets while not breaking the code I have which works only on ASCII or is this some kind of rare exception?
Here's the file in pgn (it can be openned as txt).
The file is almost in ASCII. The problem is with the quote character in `'Cote d’Ivoire'.
The file contains a 0x92 byte. In Windows code page 1252 (West European Languages) it is the Unicode character U+2019 RIGHT SINGLE QUOTATION MARK.
The problem is that the 1252 code page is a slight variation from ISO-8859-1 which uses unmapped position for some common characters like the euro symbol €
and the right and left quotation marks. And it is not in the list of the always present charsets.
How to fix:
win1252
or cp1252
charset, use it.FilterInputStream
to replace the non-ascii characters for example with a space (ASCII 0x20) or from a custom Map (0x92 -> 0x27 to replace the RIGHT SINGLE QUOTATION MARK (’
) with a simple APOSTROPHE ('
)). After that, the InputStreamReader
will give you the expected characters.