How do sites like this store tens of thousands of words "containing c", or like this, "words with d and c", or even further, "unscrambling" the word like CAUDK
and finding that the database has duck
. Curious from an algorithms/efficiency perspective how they would accomplish this:
Would a database be used, or would the words simply be stored in memory and quickly traversed? If a database was used (and each word was a record), how would you make these sorts of queries (with PostgreSQL for example, contains
, starts_with
, ends_with
, and unscrambles
)?
I guess the easiest thing to do would be to store all words in memory (sorted?), and just traverse the whole million or less word list to find the matches? But how about the unscramble one?
Basically wondering the efficient way this would be done.
"Containing C
" amounts to count(C) > 0
. Unscrambling CAUDC
amounts to count(C) <= 2 && count(A) <= 1 && count(U) <= 1 && count(D) <= 1
. So both queries could be efficiently answered by a database with 26 indices, one for the count of each letter in the alphabet.
Here is a quick and dirty python sqlite3
demo:
from collections import defaultdict, Counter
import sqlite3
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
alphabet = [chr(ord('A')+i) for i in range(26)]
alphabet_set = set(alphabet)
columns = ['word TEXT'] + [f'{c}_count TINYINT DEFAULT 0' for c in alphabet]
create_cmd = f'CREATE TABLE abc ({", ".join(columns)})'
cur.execute(create_cmd)
for c in alphabet:
cur.execute(f'CREATE INDEX {c}_index ON abc ({c}_count)')
def insert(word):
counts = Counter(word)
columns = ['word'] + [f'{c}_count' for c in counts.keys()]
counts = [f'"{word}"'] + [f'{n}' for n in counts.values()]
var_str = f'({", ".join(columns)})'
val_str = f'({", ".join(counts)})'
insert_cmd = f'INSERT INTO abc {var_str} VALUES {val_str}'
cur.execute(insert_cmd)
def unscramble(text):
counts = {a:0 for a in alphabet}
for c in text:
counts[c] += 1
where_clauses = [f'{c}_count <= {n}' for (c, n) in counts.items()]
select_cmd = f'SELECT word FROM abc WHERE {" AND ".join(where_clauses)}'
cur.execute(select_cmd)
return list(sorted([tup[0] for tup in cur.fetchall()]))
print('Building sqlite table...')
with open('/usr/share/dict/words') as f:
word_set = set(line.strip().upper() for line in f)
for word in word_set:
if all(c in alphabet_set for c in word):
insert(word)
print('Table built!')
d = defaultdict(list)
for word in unscramble('CAUDK'):
d[len(word)].append(word)
print("unscramble('CAUDK'):")
for n in sorted(d):
print(' '.join(d[n]))
Output:
Building sqlite table...
Table built!
unscramble('CAUDK'):
A C D K U
AC AD AK AU CA CD CU DA DC KC UK
AUK CAD CUD
DUCK