I have a Java application that parses pdf files in a directory and its subdirectories and creates a database using the information found in the files.
Everything was fine when I was using the program on around 900 files or so (which create a SQLite database with multiple tables, some of wihch contain 150k rows).
Now I'm trying to run my program on a larger set of data (around 2000 files) and at some point I get "OutOfMemoryError: Java Heap space". I changed the following line in my jdev.conf file:
AddVMOption -XX:MaxPermSize=256M
to 512M and I got the same error (though later, I think). I'm going to change it to something bigger again, but the thing is the computers this program will be used on are much older and thus don't have as much memory. Normally, the users are not going to add more than 30 files at a time, but I want to know at how many files I'm supposed to limit them to. Ideally, I'd like my program not to throw an error regardless of how many files are to be parsed.
At first, I thought it was my SQLite queries that were causing the error, but after reading up on Google, it's probably some recursive function. I isolated it (I think it's the correct one at least), to this function:
public static void visitAllDirsAndFiles(File dir) {
if(dir.isDirectory())
{
String[] children = dir.list();
for (int i=0; i<children.length; i++)
{
visitAllDirsAndFiles(new File(dir, children[i]));
}
}
else
{
try
{
BowlingFilesReader.readFile(dir);
}
catch(Exception exc)
{
exc.printStackTrace();
System.out.println("Other Exception in file: " + dir);
}
}
}
I think the problem might be that it recursively calls this function for each subsequent directory, but I'm really not sure that could be the problem. What do you think? If it might be, how can I make it so I don't get this error again? If you think it is impossible that this section alone causes the problem, I'll try to find which other part of the program can cause it.
The only other thing I can see causing that is that I connect to the database before calling the above method and I disconnect after it returns. The reason for that is that if I connect and disconnect after each file, my programs takes a lot longer to parse the data, so I'd really like not to have to change that.
If the origin of the problem was recursion, you would get an error related to stack instead of heap. Seems that you have some kind of memory leak in BowlingFilesReader
...