Search code examples
javaalgorithmprojecttrie

Optimizing space usage in a trie structure in a java program


I am trying to implement a trie structure in Java with 203675 words for a text editor.

Before, I was using ArrayList to store the words and that was taking 90 megabytes of space. So I want to use a trie to minimize space consumption.

Here is what I have so far, but now space consumption is 250 megabytes. What is the reason for this increase?

package TextEditor;

import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;

class Vertex {
    int words;
    Map<Character, Vertex> child;
    public Vertex() {
        words = 0;
        child = new HashMap<>();
    }
}
class Trie {
    private Vertex root;
    private InputStream openFile;
    private OutputStream openWriteFile;
    private BufferedReader readFile;
    private BufferedWriter writeFile;
    public Trie() {
        root = new Vertex();
    }
    public Trie(String path) {
         try {
            root = new Vertex();
            openFile = getClass().getResourceAsStream(path);
            readFile = new BufferedReader( new InputStreamReader(openFile));
            String in = readFile.readLine();
                    while(readFile.ready()) {
                        this.insert(in);
                    try {
                        in = readFile.readLine();
                    } catch (IOException ex) {
                        JOptionPane.showMessageDialog(null, 
                            "TRIE CONSTRUCTION ERROR!!!!");
                    }
                    }
        } catch (IOException ex) {
            JOptionPane.showMessageDialog(null, 
                "TRIE CONSTRUCTION ERROR!!!!");
        }
    }
    private void addWord(Vertex vertex, String s, int i) {
        try {
        if(i>=s.length()) {
            vertex.words += 1;
            return;
        }
        char ind  = s.charAt(i);
        if(!vertex.child.containsKey(ind)) {
            vertex.child.put(ind, new Vertex());
        }
    addWord(vertex.child.get(ind), s, i+1);
        } catch(Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }
    final void insert(String s) {
        addWord(root, s.toLowerCase(), 0);
    }
    private void DFS(Vertex v, String s, ArrayList list, 
        boolean store, String startsWith, int ind) {
    if(v != null && v.words != 0) {
            if(!store) {
                System.out.println(s);
            }
            else {
                if(s.length() >= startsWith.length()) {
                    list.add(s);
                }
            }
        }
        for (Map.Entry<Character, Vertex> entry : v.child.entrySet()) {
            Character c = entry.getKey();
            if((startsWith ==  null) || (ind>=startsWith.length()) || 
                (startsWith.charAt(ind) == c)) {
                    DFS(v.child.get(c), s + c, list, store, startsWith, ind+1);
             }
        }
    }
    public void Print() {
        DFS(root, new  String(""), null, false, null, 0);
    }
    ArrayList<String> getAsList(String startsWith) {
        ArrayList ret = new ArrayList();
        DFS(root, new  String(""), ret, true, startsWith, 0);
        return ret;
    }
    int count(Vertex  vertex, String s, int i) {
    if(i >= s.length()) {
            return vertex.words;
        }
    if(!vertex.child.containsKey(s.charAt(i))) {
            return 0;
        }
        return count(vertex.child.get(s.charAt(i)), s, i+1);
    }
    int count(String s) {   
        return count(root, s, 0);
    }
}

Is there a working example of a trie structure I can use?


Solution

  • Your use of the word "space" is ambiguous. Based on your description, it sounds like you're talking about the heap. If so, the reason for the increased memory usage is that a data structure like a trie actually does take up extra memory to store its references between nodes. An ArrayList just packs everything in, one String reference after another, and it doesn't have any additional information beyond how long the array is. A trie has lots more bookkeeping to specify the relationships between all of the nodes.

    In particular, the HashMap in each vertex is going to be extremely expensive; the Sun implementation allocates enough space for a 16-entry map by default, and that requires storage for the map's own memory-allocation record, hashCodes (32-bit ints, not chars), the object wrappers for each Character...