I am using Univocity parser version 2.7.3. I have a CSV file that has 1 Million records and might grow in future. I am reading only a few specific columns from the file and below are my requirements:
DO NOT store the CSV contents into memory at any point
Ignore/skip bean creation if either of latitude or longitude columns in CSV are null/blank
To meet these requirements, I tried implementing CsvRoutines so that the CSV data is not copied over to memory. I am using @Validate annotation on both "Latitude" and "Longitude" fields and have used error handler to not throw back any exception so that the record will be skipped on validation failure.
Sample CSV:
#version:1.0
#timestamp:2017-05-29T23:22:22.320Z
#brand:test report
network_name,location_name,location_category,location_address,location_zipcode,location_phone_number,location_latitude,location_longitude,location_city,location_state_name,location_state_abbreviation,location_country,location_country_code,pricing_type,wep_key
"1 Free WiFi","Test Restaurant","Cafe / Restaurant","Marktplatz 18","1233","+41 263 34 05","1212.15","7.51","Basel","test","BE","India","DE","premium",""
"2 Free WiFi","Test Restaurant","Cafe / Restaurant","Zufikerstrasse 1","1111","+41 631 60 00","11.354","8.12","Bremgarten","test","AG","China","CH","premium",""
"3 Free WiFi","Test Restaurant","Cafe / Restaurant","Chemin de la Fontaine 10","1260","+41 22 361 69","12.34","11.23","Nyon","Vaud","VD","Switzerland","CH","premium",""
"!.oist*~","HoistGroup Office","Office","Chemin de I Etang","CH-1211","","","","test","test","GE","Switzerland","CH","premium",""
"test","tess's Takashiro","Cafe / Restaurant","Test 1-10","870-01","097-55-1808","","","Oita","Oita","OITA","Japan","JP","premium","1234B"
TestDTO.java
@Data
@NoArgsConstructor
@AllArgsConstructor
@JsonIgnoreProperties(ignoreUnknown = true)
public class TestDTO implements Serializable {
@Parsed(field = "location_name")
private String name;
@Parsed(field = "location_address")
private String addressLine1;
@Parsed(field = "location_city")
private String city;
@Parsed(field = "location_state_abbreviation")
private String state;
@Parsed(field = "location_country_code")
private String country;
@Parsed(field = "location_zipcode")
private String postalCode;
@Parsed(field = "location_latitude")
@Validate
private Double latitude;
@Parsed(field = "location_longitude")
@Validate
private Double longitude;
@Parsed(field = "network_name")
private String ssid;
}
Main.java
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.detectFormatAutomatically();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setSkipEmptyLines(true);
parserSettings.selectFields("network_name", "location_name","location_address", "location_zipcode",
"location_latitude", "location_longitude", "location_city","location_state_abbreviation", "location_country_code");
parserSettings.setProcessorErrorHandler(new RowProcessorErrorHandler() {
@Override
public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
//do nothing
}
});
CsvRoutines parser = new CsvRoutines(parserSettings);
ResultIterator<TestDTO, ParsingContext> iterator = parser.iterate(TestDTO.class, new FileReader("c:\\users\\...\\test.csv")).iterator();
int i=0;
while(iterator.hasNext()) {
TestDTO dto = iterator.next();
if(dto.getLongitude() == null || dto.getLatitude() == null)
i++;
}
System.out.println("count=="+i);
Problem:
I actually expected the count to be zero since I have added error handler and haven't thrown back the data validation exception but seems thats not the case. I thought @Validate will throw back an exception when it encounters a record with either Latitude or Longitude as null (both the columns may be null in same record as well) which is handled and ignored/skipped at error handler.
Basically I do not want UniVocity to create and map unnecessary DTO objects in heap (and lead to out of memory) since there are chances that the incoming CSV file might have more than 200 or 300k records with either Latitude/Longitude as null.
I even tried adding custom validator in @Validate as well but in vain.
Could someone please let me know what am I missing here?
Author of the library here. You are doing everything right. This is a bug and I just opened this issue here to be resolved today.
The bug appears when you select fields: the reordering of values makes the validation run against something else (in my test, it validated the city instead of latitude).
In your case, just add the following line of code and it will work fine:
parserSettings.setColumnReorderingEnabled(false);
This will make the rows be generated with nulls where fields were not selected, instead of removing the nulls and reordering the values in the parsed row. It will avoid the bug and also make your program run slightly faster.
You will also need to test for null
in the iteration bit:
TestDTO dto = iterator.next();
if(dto != null) { // dto may come null here due to validation
if (dto.longitude == null || dto.latitude == null)
i++;
}
}
Hope this helps and thank you for using our parsers!