java spring csv character-encoding opencsv

OpenCSV reads in additional byte value together with first line's first value together in Java

I was working on a project where we use OpenCSV to read in CSV files and fill up a database with them at start. I noticed that there is a strange thing, that in certain cases a given identifier value can not be queried. During debugging I found that OpenCSV does not read up the CSV correctly.

Let's say that I have the following CSV file:

01;foo
02;bar
...

The first line in the example is the first line in the real CSV file as well. The file is encoded in UTF-8. The following code is used to read in the value:

 try (CSVReader csvReader = CSVUtils.createCSVReader(masterDataCSVPath, csvDelimiter)) {
            List<String[]> masterData = csvReader.readAll();
        }

The code creating the csvReader:

    static private CSVParser createCSVParser(String CSVDelimiter) {
        return new CSVParserBuilder().withSeparator(CSVDelimiter.charAt(0)).build();
    }

    static public CSVReader createCSVReader(String CSVPath, String CSVDelimiter) throws FileNotFoundException {
        return new CSVReaderBuilder(new FileReader(CSVPath)).withCSVParser(createCSVParser(CSVDelimiter)).build();
    }

When I read in the CSV file with the following code, during debug I get the following byte values for 01:

However if I change my CSV file to (notice the newline at the top):


01;foo
02;bar
...

The read-in data becomes:

In this case "all is good", if I remove the first item in my masterData list, I can read in the values "properly". However, this is not a clean solution:

It begs the question: Why does this happen?
Also, I do not think that we should work around the problem rather than solving it. This is only provided to work if there a newline at the beginning of my source CSV.

So I kindly ask for help, that how can this be mitigated?

Solution

This is not an OpenCSV specific problem, but rather that FileReader reads in the BOM in the UTF encoded file. This is kind of unexpected, but it makes sense, as there is no context for FileReader that it should excludes those bytes.

The solution would be to either manually remove it, or - in my case - use a library to make sure it is excluded. I wrote the following utility class:

public class CSVUtils {

private static CSVParser createCSVParser(final String CSVDelimiter) {
    return new CSVParserBuilder().withSeparator(CSVDelimiter.charAt(0)).build();
}

private static  BOMInputStream versatileBOMInputStreamGenerator(final InputStream inputStream) {

    return new BOMInputStream(inputStream, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
            ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_32LE, ByteOrderMark.UTF_32BE);
}


public static  CSVReader createCSVReaderFromFile(final String CSVPath, final String CSVDelimiter) throws FileNotFoundException {
    return new CSVReaderBuilder(new InputStreamReader(
            versatileBOMInputStreamGenerator(new FileInputStream(CSVPath)), StandardCharsets.UTF_8))
            .withCSVParser(createCSVParser(CSVDelimiter)).build();
}

public static  CSVReader createCSVReaderFromString(final String content, final String CSVDelimiter) {
    byte[] contentBytes = content.getBytes(StandardCharsets.UTF_8);
    return new CSVReaderBuilder(new InputStreamReader(
            versatileBOMInputStreamGenerator(new ByteArrayInputStream(contentBytes)), StandardCharsets.UTF_8))
            .withCSVParser(createCSVParser(CSVDelimiter)).build();
}

}

All I have to do is use these created CSVReader objects later where needed. As you can see, it uses some dependencies, which can be imported with

import org.apache.commons.io.ByteOrderMark;
import org.apache.commons.io.input.BOMInputStream;

These dependencies can be added to the project via the POM as follows:

        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.11.0</version>