I was working on a project where we use OpenCSV to read in CSV files and fill up a database with them at start. I noticed that there is a strange thing, that in certain cases a given identifier value can not be queried. During debugging I found that OpenCSV does not read up the CSV correctly.
Let's say that I have the following CSV file:
01;foo
02;bar
...
The first line in the example is the first line in the real CSV file as well. The file is encoded in UTF-8. The following code is used to read in the value:
try (CSVReader csvReader = CSVUtils.createCSVReader(masterDataCSVPath, csvDelimiter)) {
List<String[]> masterData = csvReader.readAll();
}
The code creating the csvReader
:
static private CSVParser createCSVParser(String CSVDelimiter) {
return new CSVParserBuilder().withSeparator(CSVDelimiter.charAt(0)).build();
}
static public CSVReader createCSVReader(String CSVPath, String CSVDelimiter) throws FileNotFoundException {
return new CSVReaderBuilder(new FileReader(CSVPath)).withCSVParser(createCSVParser(CSVDelimiter)).build();
}
When I read in the CSV file with the following code, during debug I get the following byte values for 01
:
However if I change my CSV file to (notice the newline at the top):
01;foo
02;bar
...
The read-in data becomes:
In this case "all is good", if I remove the first item in my masterData
list, I can read in the values "properly". However, this is not a clean solution:
So I kindly ask for help, that how can this be mitigated?
This is not an OpenCSV specific problem, but rather that FileReader
reads in the BOM in the UTF encoded file. This is kind of unexpected, but it makes sense, as there is no context for FileReader
that it should excludes those bytes.
The solution would be to either manually remove it, or - in my case - use a library to make sure it is excluded. I wrote the following utility class:
public class CSVUtils {
private static CSVParser createCSVParser(final String CSVDelimiter) {
return new CSVParserBuilder().withSeparator(CSVDelimiter.charAt(0)).build();
}
private static BOMInputStream versatileBOMInputStreamGenerator(final InputStream inputStream) {
return new BOMInputStream(inputStream, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_32LE, ByteOrderMark.UTF_32BE);
}
public static CSVReader createCSVReaderFromFile(final String CSVPath, final String CSVDelimiter) throws FileNotFoundException {
return new CSVReaderBuilder(new InputStreamReader(
versatileBOMInputStreamGenerator(new FileInputStream(CSVPath)), StandardCharsets.UTF_8))
.withCSVParser(createCSVParser(CSVDelimiter)).build();
}
public static CSVReader createCSVReaderFromString(final String content, final String CSVDelimiter) {
byte[] contentBytes = content.getBytes(StandardCharsets.UTF_8);
return new CSVReaderBuilder(new InputStreamReader(
versatileBOMInputStreamGenerator(new ByteArrayInputStream(contentBytes)), StandardCharsets.UTF_8))
.withCSVParser(createCSVParser(CSVDelimiter)).build();
}
}
All I have to do is use these created CSVReader
objects later where needed. As you can see, it uses some dependencies, which can be imported with
import org.apache.commons.io.ByteOrderMark;
import org.apache.commons.io.input.BOMInputStream;
These dependencies can be added to the project via the POM as follows:
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.11.0</version>