Search code examples
javadata-structuressuffix-tree

Java Suffix Trie exceeding heap space


I am implementing a suffix trie (this is different from a suffix tree) that stores the characters suffixes of strings as nodes in a tree structure where a string is made up by following traversing the tree until you hit a '$' or you hit the end of your search.

The problem is that constructing this trie consumes more memory than Java has when using a large text file. Are there any placed that I could cut down on memory use in terms of data structures? This is homework and it isn't a requirement to make it a compressed suffix trie (which is basically a suffix tree).

This is the basic structure that I currently have (I can provide the implementation details if you really want):

// SuffixTrie.java

public class SuffixTrie {
    private SuffixTrieNode root = new SuffixTrieNode();

    // implementation of insertions into tree etc..


    public static void main(String[] args) throws FileNotFoundException {   
        String fileName = "Frankenstein.txt";
        SuffixTrie st = readInFromFile(fileName);
        String[] ss = {"without","hideous", "the only", "onster", ", the", "ngeuhhh"};
        for (String s: ss) {
            SuffixTrieNode sn = st.get(s);
            System.out.println("[" + s + "]: " + sn);
        }
    }
}

Each node is:

// SuffixTrieNode.java
public class SuffixTrieNode {
    private char label; // Indicates the letter for this node
    private boolean isTerminal = false;
    private SuffixTrieData data;
    private HashSet<SuffixTrieNode> children; 
 // Inserting adds more SuffixTrieNodes to the children of the node

The data held in each node is:

public class SuffixTrieData {
    private ArrayList<Pair> startIndexes = new ArrayList<Pair>();

    public SuffixTrieData(int sentence, int index){
        addStartIndex(sentence, index);
    }   
    public class Pair{
        public int sentence;
        public int index;
        public Pair(int sentence, int index){
            this.sentence = sentence;
            this.index = index;
        }
    }
}

The error I get is:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.<init>(Unknown Source)
    at java.util.ArrayList.<init>(Unknown Source)
    at SuffixTrieData.<init>(SuffixTrieData.java:7)
    at SuffixTrie.insert(SuffixTrie.java:20)
    at SuffixTrie.insert(SuffixTrie.java:11)
    at SuffixTrie.readInFromFile(SuffixTrie.java:77)
    at SuffixTrie.main(SuffixTrie.java:89)

It works fine for smaller text files though and this is the first time they have given students this assignment, so the instructors don't know if this is doable with a suffix trie anyway..


Solution

  • A suffix trie is going to use a lot of space just for the words (letters) alone. In addition it appears you are storing an array of every sentence the word appears in with an index (the code you post is incomplete, correct me if I'm wrong). If the file is fairly large ... that's going to take up some space.

    One thing you could do is compress the sentences when storing, and decompress when retrieving them using deflate/inflate.

    Aside from that, you probably want to increase the heap size for the JVM when you run the process, using the -Xmx option (e.g. java -Xmx 2GB -jar myJarFile.jar).