I have two sets of HProf dumps one for large sample and other for smaller sample - both are result from a very small sample of the huge data that I have. I'm trying to figure out the bottleneck in my approach.
Here are my Heap allocation data for Large (http://pastebin.com/PEH8yR3v) and small sample (http://pastebin.com/aR8ywkDH).
I notice that char[] is the one that takes most of my memory.Also the % of memory takaen by char[] varies from small to large sample run. I don't know how it will vary when I profile my whole sample.
But, the important question I'm concerned is - with this program( READ, PARSE/PROCESS ,WRITE) when I try to run for a input data of size of 3GB which writes back 10GB of data. Except for a list whose size is not more than 1GB, I'dont store anything in memory - this is plain read, process, write pipeline. Given this, my program still takes around 7GB of main memory while running.
This is my approach,
read a file in from a string Iterator
for each line in ip_file perform
op_buffer = myFunction(line)
write op_buffer to op_file.
Perform this for all 20K files in my input data.
def myFunction(line)
{
var :String = null;
for each word in line
{
var class_obj = new Classname(word)
op_line + = class_obj.result
}
return op_line
}
Since, the objects created inside the myFunction will scope out at the end of myFunction, I don't take care to delete/free them. Do you guys sense any bottlenecks ?
Since, the objects created inside the myFunction will scope out at the end of myFunction
No, they won't. This is not C++. All objects are created on the heap and remain in existence until garbage collectible.
Also, you haven't declared op_line
anywhere in your pseudocode, so I assume it is being retained between method calls, and I guess that is your memory leak. I mean there is no way you should have a single character array consisting of > 100 million bytes, which is what the "small" heap dump says you have.