I face an issue when i try to open a arff file with Weka.
When the encoding of arff file is set to ANSI everything seems to work well. But when i set the encoding to utf-8 (which is what my data require) i get the following error:
Unable to determine structure as arff(Reason java.io.Exception: keyword @relation expected,read token[@relation], line 1).
my arff file seems to be properly formatted.
@relation myrelation
@attribute pagename string
@attribute pagetext string
@attribute pagecategory string
@attribute pageclass {0,1,2,3,4,5,6,7,8,9,10}
@data
.......
note: I also changed the file encoding to utf-8 in RunWeka.ini file
As the error mentions line 1, I have the suspicion the UTF-8 file is written with a BOM at the start of the file. This unneeded zero-width space is used by Notepad under Windows to distinghuish an ANSI text file from a UTF-8 text file.
Create the file without BOM, U+FEFF
. This can be done by a programmer's editor (JEdit, Notepad++), some hex editor, or you could delete the first line and re-type it. Check the file size.
Many parsers do not expect such a BOM, do not consider it whitespace, and hang.
Path path = Paths.get("...");
String s = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);
String t = s.replaceFirst("^\uFEFF", "");
if (!s.equals(t)) {
System.out.println("BOM character present in UTF-8 text");
Files.write(path, t.getBytes(StandardCharsets.UTF_8)); // Replaces file!
}