In some part of my application, I am parsing a 17MB log file into a list structure - one LogEntry per line. There are approximately 100K lines/log entries, meaning approx. 170 bytes per line. What surprised me is that I run out of heap space, even when I specify 128MB (256MB seems sufficient). How can 10MB of text turned into a list of objects cause a tenfold increase in space?
I understand that String objects use at least twice the amount of space compared to ANSI text (Unicode, one char=2 bytes), but this consumes at least four times that.
What I am looking for is an approximation for how much an ArrayList of n LogEntries will consume, or how my method might create extraneous objects that aggravate the situation (see comment below on String.trim()
)
This is the data part of my LogEntry class
public class LogEntry {
private Long id;
private String system, version, environment, hostName, userId, clientIP, wsdlName, methodName;
private Date timestamp;
private Long milliSeconds;
private Map<String, String> otherProperties;
This is the part doing the reading
public List<LogEntry> readLogEntriesFromFile(File f) throws LogImporterException {
CSVReader reader;
final String ISO_8601_DATE_PATTERN = "yyyy-MM-dd HH:mm:ss,SSS";
List<LogEntry> logEntries = new ArrayList<LogEntry>();
String[] tmp;
try {
int lineNumber = 0;
final char DELIM = ';';
reader = new CSVReader(new InputStreamReader(new FileInputStream(f)), DELIM);
while ((tmp = reader.readNext()) != null) {
lineNumber++;
if (tmp.length < LogEntry.getRequiredNumberOfAttributes()) {
String tmpString = concat(tmp);
if (tmpString.trim().isEmpty()) {
logger.debug("Empty string");
} else {
logger.error(String.format(
"Invalid log format in %s:L%s. Not enough attributes (%d/%d). Was %s . Continuing ...",
f.getAbsolutePath(), lineNumber, tmp.length, LogEntry.getRequiredNumberOfAttributes(), tmpString)
);
}
continue;
}
List<String> values = new ArrayList<String>(Arrays.asList(tmp));
String system, version, environment, hostName, userId, wsdlName, methodName;
Date timestamp;
Long milliSeconds;
Map<String, String> otherProperties;
system = values.remove(0);
version = values.remove(0);
environment = values.remove(0);
hostName = values.remove(0);
userId = values.remove(0);
String clientIP = values.remove(0);
wsdlName = cleanLogString(values.remove(0));
methodName = cleanLogString(stripNormalPrefixes(values.remove(0)));
timestamp = new SimpleDateFormat(ISO_8601_DATE_PATTERN).parse(values.remove(0));
milliSeconds = Long.parseLong(values.remove(0));
/* remaining properties are the key-value pairs */
otherProperties = parseOtherProperties(values);
logEntries.add(new LogEntry(system, version, environment, hostName, userId, clientIP,
wsdlName, methodName, timestamp, milliSeconds, otherProperties));
}
reader.close();
} catch (IOException e) {
throw new LogImporterException("Error reading log file: " + e.getMessage());
} catch (ParseException e) {
throw new LogImporterException("Error parsing logfile: " + e.getMessage(), e);
}
return logEntries;
}
Utility function used for populating the map
private Map<String, String> parseOtherProperties(List<String> values) throws ParseException {
HashMap<String, String> map = new HashMap<String, String>();
String[] tmp;
for (String s : values) {
if (s.trim().isEmpty()) {
continue;
}
tmp = s.split(":");
if (tmp.length != 2) {
throw new ParseException("Could not split string into key:value :\"" + s + "\"", s.length());
}
map.put(tmp[0], tmp[1]);
}
return map;
}
You also have a Map there, where you store other properties. Your code doesn't show how this Map is populated, but keep in mind that Maps may have a hefty memory overhead compared to the memory needed for the entries themselves.
The size of the array that backs the Map (at least 16 entries * 4 bytes) + one key/value pair per entry + the size of data themselves. Two map entries, each using 10 chars for key and 10 chars for value, would consume 16*4 + 2*2*4 + 2*10*2 + 2*10*2 + 2*2*8= 64+16+40+40+24 = 184 bytes (1 char = 2 byte, a String object consumes min 8 byte). That alone would almost double the space requirements for the entire log string.
Add to this that the LogEntry contains 12 Objects, i.e. at least 96 bytes. Hence the log objects alone would need around 100 bytes, give or take some, without the Map and without actual string data. Plus all the pointers for the references (4B each). I count at least 18 with the Map, meaning 72 bytes.
Adding the data (-object references and object "headers" mentioned in the last paragraph):
2 longs = 16B, 1 date stored as long = 8B, the map = 184B. In addition comes the string content, say 90 chars = 180 byte. Perhaps a byte or two in each end of the list item when put in the list, so in total somewhere around 100+72+16+8+184+180=560 ~ 600 byte per log line.
So around 600 byte per log line, meaning 100K lines would consume around 60MB minimum. This would place it at least in the same order of magnitude as the heap size that was set asize. In addition comes the fact that tmpString.trim() in a loop might be creating copies of string. Similarly String.format() may also be creating copies. The rest of the application must also fit within this heap space, and might explain where the rest of memory is going.