Search code examples
javaapache-commons-csv

Java CSVParser gets empty after reading it


In the snippet below I try to read an excel file by using the CSVParser from the Apache Commons library. The question is why records.getRecords(); makes the list of records empty. How should I be aware of this behavior?

import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;

import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.ArrayList;
import java.util.List;

public class ReadCSV {

    public ReadCSV() {
    }

    /* Define headers as enum */
    enum HEADER {
        ID, NAME, AGE
    }

    public List<List<String>> ReadCSVToList(String csvPath) throws IOException, HighBalanceException {
        List<List<String>> csvList = new ArrayList<>();
        try {


            Reader reader = new FileReader(csvPath);
            CSVParser  records = CSVFormat.DEFAULT.withHeader(HEADER.class).parse(reader);
            List<CSVRecord> records1 = records.getRecords();
            System.out.println(records1.size()); // 2
            List<CSVRecord> records2 = records.getRecords();
            System.out.println(records2.size()); // 0

Solution

  • It helps to read the documentation of CSVParser:

    Parses CSV files according to the specified format. [...] The parser works record wise. It is not possible to go back, once a record has been parsed from the input stream.

    And a few paragraphs later, under the heading "Parsing into memory":

    If parsing record wise is not desired, the contents of the input can be read completely into memory.

    Reader in = new StringReader("a;b\nc;d");
    CSVParser parser = new CSVParser(in, CSVFormat.EXCEL);
    List<CSVRecord> list = parser.getRecords();
    

    There are two constraints that have to be kept in mind:

    1. Parsing into memory starts at the current position of the parser. If you have already parsed records from the input, those records will not end up in the in memory representation of your CSV data.
    2. Parsing into memory may consume a lot of system resources depending on the input. For example if you're parsing a 150MB file of CSV data the contents will be read completely into memory.

    When you call records.getRecords() the first time you are reading the CSV file completely into memory. That together with the fact that "parsing into memory starts at the current position of the parser" means that for the second call there are no more records to parse (because the parser has already read the file completely.)