This is a strange one but I have found the Stackoverflow community to be very helpful. I have mySQL Table with a column full of parsed text data. I want to analyze the data and see in how many rows words appear.
ID columnName
1 Car
2 Dog
3 CAR CAR car CAR
From the above example what I want returned is that the word CAR appears in two rows and the word Dog Appears in 1 row. I don't really care how much the word count is as much as in how many rows does the word appear in. The problem is that I don't know which words to search for. Is there a tool, or something I can build in python, that would show me the most popular words used and in how many rows do the words appear in. I have not idea where to start and it would be great if someone could assist me with this.
I'd use python:
1) setup python to work with mysql (loads of tutorials online)
2) define:
from collections import defaultdict
tokenDict = defaultdict(lambda: 0)
the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0)
3) read each row from the table, tokenize it and increment the token counts
tokens = row.split(' ') //tokenize
tokens = [lower(t) for t in tokens] //lowercase
tokens = set(tokens) //remove duplicates
for token in tokens:
tokenDict[token] = tokenDict[token] + 1