Search code examples
machine-learningnlpweka

Unable to determine structure as arff when using utf-8 arff file in Weka


I face an issue when i try to open a arff file with Weka.

When the encoding of arff file is set to ANSI everything seems to work well. But when i set the encoding to utf-8 (which is what my data require) i get the following error:

Unable to determine structure as arff(Reason java.io.Exception: keyword @relation expected,read token[@relation], line 1).

my arff file seems to be properly formatted.

@relation myrelation

@attribute pagename string
@attribute pagetext string
@attribute pagecategory string
@attribute pageclass {0,1,2,3,4,5,6,7,8,9,10}

@data
.......

note: I also changed the file encoding to utf-8 in RunWeka.ini file


Solution

  • As the error mentions line 1, I have the suspicion the UTF-8 file is written with a BOM at the start of the file. This unneeded zero-width space is used by Notepad under Windows to distinghuish an ANSI text file from a UTF-8 text file.

    Create the file without BOM, U+FEFF. This can be done by a programmer's editor (JEdit, Notepad++), some hex editor, or you could delete the first line and re-type it. Check the file size.

    Many parsers do not expect such a BOM, do not consider it whitespace, and hang.

    Path path = Paths.get("...");
    String s = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);
    String t = s.replaceFirst("^\uFEFF", "");
    if (!s.equals(t)) {
        System.out.println("BOM character present in UTF-8 text");
        Files.write(path, t.getBytes(StandardCharsets.UTF_8)); // Replaces file!
    }