Search code examples
javahashmapinputstreambufferedreaderfileinputstream

Java - Reading a file and loading in HashMap - How to reduce time?


I'm reading files of sizes around 20 MB with around 500,000 records in it. I'm loading the records into a HashMap with a particular field as key and another field as the value. This Map's Key-values are used in the subsequent process.

The time to simply read the file is negligible. But, parsing the field and load into HashMap seems to take hours. The code looks somewhat likes this,

InputStream in = new FileInputStream(new File(file));
br = new BufferedReader(new InputStreamReader(in), 102400);
if (br != null) {
    for (String record; (record = br.readLine()) != null;) {
        sb = new StringBuilder(record);

        map.put(sb.substring(findStartIndex(fieldName1),findEndIndex(fieldName1)), sb.substring(findStartIndex(fieldName2),findEndIndex(fieldName2)));

    }
}

where findStartIndex() and findEndIndex() are methods to parse an record format xml and find the start and end indexes of the field.

I need to repeat this process for a bunch of files. Suggest me someway to reduce the runtime. Any help is appreciated. Thanks.

Edit: I implemented the findStartindex and findEndindex as below,

Input is xml with field names and index values. I used SaxParser, getters and setters for each.. found the values of start and end.


Solution

  • You can read millions of lines a second with a BufferedReader. The time is undoubtedly going in your unshown XML parsing. It seems you aren't using a proper parser but instead you're apparently just doing string searching on the XML, starting from the beginning of the string both times, which is quadratic, or else parsing each line as XML four times, which is worse. Don't do that. Use XPath to find your fields, it's a lot quicker, or a properly implemented SAX parser listener.

    And I don't see any good reason for creating a new StringBuilder each line when you already have the line itself.

    NB br cannot possibly be null at the point you are testing it.