Search code examples
dictionarymachine-learninglookuptrie

How to fit a large words dictionary in a small space with least effect on accuracy?


I am trying to implement a word play game using a microcontroller which only allows 30kb of data. And for that I need to lookup the words from a specific dictionary of allowed words which is almost 4 MB in size when uncompressed.

I don't need to give the correct answer every time so I can compromise on the accuracy. Is there a way to fit a 4MB dictionary in a 30kb space with minimum loss of accuracy?

I have already tried using an optimized 'trie' data structure as suggested here, using the compressed trie generator here which brought down the size from 4 MB to 740 KB, but I can't figure out a way to make it smaller without throwing away a significant chunk of the words.

The 'trie' would always give me the correct answer. Is there a way to reduce the size by trading off with accuracy and work out a structure which may give me the right answer most of the times? Maybe I could use a machine learning model or something related to it?

I understand that it's almost next to impossible. But the game is devised such that you don't need the accurate answer. Even an accuracy of ~25% is still reasonable.

I may leave out the longest words till the dictionary fits into that size. But that might not be the best approach in this case.


Solution

  • Unfortunately, I have to agree with the consensus emerging here. I've written some similar software (a Scrabble bot), so I've referred to my code and played around to make some calculations. I use the SOWPODS dictionary, which is actually quite a bit smaller than what you're describing - 267,751 words, which uncompressed occupies 2707014 bytes.

    Using a trie data structure is crucial for implementing AI for playing a game like Scrabble, not just because it reduces the size of the dictionary in memory, but because the basic structure dramatically decreases the computational complexity of the search functionality. As you try out possible permutations, you can immediately stop as soon as you hit a leaf in the trie. I bring this up because if you're trying to use an Arduino for this, you will inevitably also need to ensure that the code is very efficient in terms of speed.

    But in order to use the trie to ensure reasonable performance, this also means you'll need to establish linkages between nodes, and with a simple implementation on a 32-bit architecture, those links will occupy 4 bytes each. You could probably implement fancier logic to reduce the nodes to storing offsets with 2 bytes each (2^15 to point to the offset into your memory and the extra bit as an indication whether that node represents a word). But even then, that means you'd need the trie to have 15K nodes (actually less since it stands to reason you'll need some code in there too. :)

    I toyed around with limiting the maximum size of the words to see what's necessary to bring the number of nodes down far enough... Bad news, you could only store words up to 4 characters in length! Here's the number of nodes per maximum size:

    15: 589315
    14: 572754
    13: 546969
    12: 508959
    11: 456252
    10: 387321
    9: 304186
    8: 212237
    7: 126700
    6: 63605
    5: 25776
    4: 8208
    

    So basically, by the time you've reduce the size of the dictionary enough, it's no longer valuable to even use more sophisticated algorithms. There's just not enough memory to make it work.

    In response to the idea of using machine learning models, my experience has been that building a functioning model that can achieve even somewhat reasonable accuracy generally requires a considerable amount of memory, and getting reasonable performance requires moderately powerful hardware, even when only performing prediction. (Training is incredibly expensive, but you could do that offline.)

    Even reading the database from a disk may be a non-starter depending on the efficiency required. Caching can only get you so far.

    Honestly, I think @TypeKatz suggestion was the most reasonable. The Arduino simply isn't designed for this kind of application, so the best thing would be to offload the computationally expensive, memory-intensive processing to an external device. You might use an attached device over the serial port, or perhaps invest in a Wifi sheild and communicate with a server located nearby.

    Anyway, best of luck!