Search code examples
javacsvutf-8character-encodingopencsv

OpenCSV CsvToBean: First column not read for UTF-8 Without BOM


Using OpenCSV to parse UTF-8 documents without BOM results in the first column not read. Giving as an input the same document content but encoded in UTF-8 with BOM works correctly.

I set specifically the charset to UTF-8

    fileInputStream = new FileInputStream(file);
    inputStreamReader = new InputStreamReader(fileInputStream, StandardCharsets.UTF_8);
    reader = new BufferedReader(inputStreamReader);
    HeaderColumnNameMappingStrategy<Bean> ms = new HeaderColumnNameMappingStrategy<Bean>();
    ms.setType(Bean.class);
    CsvToBean<Bean> csvToBean = new CsvToBeanBuilder<Bean>(reader).withType(Bean.class).withMappingStrategy(ms)
            .withSeparator(';').build();
    csvToBean.parse();

I've created a sample project where the issue can be reproduced: https://github.com/dajoropo/csv2beanSample

Running the Unit Test you can see how the UTF-8 file without BOM fails and with BOM works correctly.

The error comes in the second assertion, because the first column in not read. Result it:

[Bean [a=null, b=second, c=third]]

Any hint?


Solution

  • If I open Bean class in you project and search for "B" then I can find one entry. If I search for "A" then I cannot :) It means you copy/pasted A with BOM header to Bean class. BOM header is not visible but still taken into account.

    If I fix "A" then another test starts failing but I think you can fix it using BOMInputStream.

    Check this question and answer Byte order mark screws up file reading in Java

    It is known problem. You can use Apache Commons IO's BOMInputStream to solve it.

    Just tried

        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>
    

    and

            inputStreamReader = new InputStreamReader(new BOMInputStream(fileInputStream), StandardCharsets.UTF_8);
    

    and fixing

    @CsvBindByName(column = "A")
    private String a;
    

    to exclude prefix from "A" makes both tests passing