Search code examples
csvgroovyapache-commons-csv

Parsing CSV in groovy with exception tolerance


I've been trying to parse a csv file in groovy, currently using the library org.apache.commons.csv 2.4. The requirement I have is that there are invalid data value in csv cells, such as invalid characters, and instead of throwing an exception on first invalid row/cell, I want to collect these exceptions and keep iterating in the csv file until the end, then I will have a full list of invalid data this csv file has.

With that purpose, I've tried multiple ways of using this apache lib, but unfortunately as long as it uses the CSVParser.getNextRecord() for iteration, the iterator will just abort.

put in code, something like this:

    def  records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())

     // at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
     records.each {record->
         // if the exception is thrown from .each, that makes below try/catch in vain
         try{

         }catch(e){ //want collect Errors here }
     }

So, is there anything else that I should dig in this library? Or could anybody point me to another more viable solution? Many thanks to all!

Update: Sample CSV

"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"

The second data row has invalid char " that makes parser throw exception


Solution

  • The problem you have is that one of the characters in one cell is the quote character used by the parser according with the format selected: CSVFormat.EXCEL.

    The quote character is

    the character used to encapsulate values containing special characters

    so in your example the quote is misused and the parser complains about it.

    You can workaround that using a different CSVFormat. For example, one without quote character:

    @Grapes(
        @Grab(group='org.apache.commons', module='commons-csv', version='1.2')
    )
    
    import java.nio.charset.*
    import org.apache.commons.csv.*
    
    def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"
    
    "1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
    "1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
    "1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''
    
    def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))
    
    parsed.getRecords().each {
        println it.toMap().values()
    }
    

    And the above yields:

    []
    ["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
    ["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
    ["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
    

    Of course, with the above workaround, you have the quotes (") included in each field.

    You can replace them all if you want:

    parsed.getRecords().each {
        println it.toMap().values().collect({ it.replace('"', '') })
    }