Search code examples
androidperformancedata-structurestreeset

Android dictionary TreeSet faster load time


I have like 300000 words in my dictionary (actually saved in txt format (new line delimited) on sdcard of my Android device). I want to build data structure that would take as less time as possible to insert words (String-s) from my txt file in this data structure. And this DS must be super fast for checking if words exists in dictionary (this DS) or not. I have tried several build-in DS and the fastest IMO was TreeSet. Is there any other (non-build-in) DS that would be faster in inserting / creating DS and as equal as TreeSet for searching?

And one more thing is there any way I can "help" TreeSet to be faster in insert by rearranging my txt file (put words in proper order).

Regards


Solution

  • Firstly, well done on experimenting to find the best structure for your application. Often people will argue without trying out various options to get real performance data.

    If you want to save build time, and your words file does not change very often, the obvious build speed improvement is caching the data structure. Whatever data structure you are using, build the structure once, and then store the structure to the SD-card (rather than just storing the strings). Standard java.util structures can be stored using Serialization.

    If you want fastest build time, and your word list is sorted in alphabetical order, or can be, then you could just store in a String array. Build time will be very quick again, and search time will be similar to a TreeSet (using Arrays.binarySearch()).

    If you want faster lookup, you might want to check out Perfect Hashing or Tries, but these aren't in the Java standard libraries.

    A trie will be much more memory efficient than either of these, which may make it quicker. (Information on finding an implementation)

    I'm surprised TreeSet is faster than HashSet in your experiments, which means that you might be operating in a situation where memory allocation is expensive. Did you remember to set the initial capacity when you allocated the HashSet? Remember to avoid an expensive rehash, you need to set the initial capacity to at least number of items/0.75 (the load factor).