Search code examples
javaoracle11gunivocity

Univocity - writing out surrounding quotes even if field does not contain delimiter char


I have a file unloaded from a database in such a way that all varchar columns are surrounded by quotes, regardless of the actual content of the column (unfortunately the unload proces is out of my control).

Like this:

1,"Alex ,/,awesome/,","chan"
2,"Peter ,boring","pitt"

When using the following code with univocity 2.2.3 in the pom:

public class Sample {   

    public static void main(String[] args) throws IOException {
        BeanListProcessor<Person> rowProcessor = new BeanListProcessor<Person>(Person.class);
        CsvParserSettings parserSettings = new CsvParserSettings();
        parserSettings.setProcessor(rowProcessor);  
        parserSettings.getFormat().setDelimiter(',');
        parserSettings.getFormat().setQuote('"');
        parserSettings.getFormat().setQuoteEscape('/');     
        CsvParser parser = new CsvParser(parserSettings);
        parser.parse(new FileReader("src/main/resources/person.csv"));
        List<Person> beans = rowProcessor.getBeans();

        Writer outputWriter = new FileWriter("src/main/resources/personOut.csv", true);
        CsvWriterSettings settings = new CsvWriterSettings();
        settings.getFormat().setDelimiter(',');
        settings.getFormat().setQuote('"');
        settings.getFormat().setQuoteEscape('/');
        settings.getFormat().setCharToEscapeQuoteEscaping('\0');
        settings.setRowWriterProcessor(new BeanWriterProcessor<Person>(Person.class));      
        CsvWriter writer = new CsvWriter(outputWriter, settings);
        for (Person person : beans) {
            writer.processRecord(person);
        }
        writer.close();
    }
}

Only the columns containing the delimiter are surrounded by quotes:

1,"Alex ,/,awesome/,",chan
2,"Peter ,boring",pitt

When using settings.setQuoteAllFields(true); on the writer setting, all the fields get surrounded by quotes, but now the non varchar fields are in trouble.

How do I surround only the columns that are surrounded by quotes from the source with quotes regardless of the content of the column (e.g. delimiter is or is not present)?

Desired result:

1,"Alex ,/,awesome/,","chan"
2,"Peter ,boring","pitt"

Solution

  • The CSV writer doesn't provide an explicit mechanism to configure this, but you can do the following:

    Parse with this:

        parserSettings.setKeepQuotes(true);
        parserSettings.setKeepEscapeSequences(true);
    

    These two settings will effectively work as a "split" operation over your input CSV - you will get the entire content between delimiters. Using your sample input, the values will be parsed as:

    1 | "Alex ,/,awesome/," | chan | 
    2 | "Peter boring" | pitt | 
    

    I'm using pipes to separate the values above to make it easier to visualize what comes out.

    Now, the hacky bit, I can't guarantee this will work with future versions of the library as it uses internal API's: the CsvWriter has a processRow method which you can override. As your input values are coming properly formatted as you want them to be, you can dump them out "as-is" by just joining the values of each row with commas. Just do the following:

    CsvWriter writer = new CsvWriter(outputWriter, settings){
        @Override
        protected void processRow(Object[] row) {
            for(int i = 0; i < row.length; i++){
                Object value = row[i];
                appender.append(value.toString());
                if(i + 1 < row.length) { //not the last column
                    appender.append(',');
                }
                appendValueToRow();
            }
        }
    };
    

    This will produce the output you expect, but I'm not sure if it's very useful because you simply depend on the input to be properly formatted and making changes over it will complicate things quite a bit.

    The appropriate thing to do here is to add an additional configuration option to the library that would allow you to configure whether to quote a given column or not.