Search code examples
mysqlpython-2.7data-analysis

Analyze MySQL Text Data


This is a strange one but I have found the Stackoverflow community to be very helpful. I have mySQL Table with a column full of parsed text data. I want to analyze the data and see in how many rows words appear.

ID     columnName
1      Car
2      Dog
3      CAR CAR car CAR

From the above example what I want returned is that the word CAR appears in two rows and the word Dog Appears in 1 row. I don't really care how much the word count is as much as in how many rows does the word appear in. The problem is that I don't know which words to search for. Is there a tool, or something I can build in python, that would show me the most popular words used and in how many rows do the words appear in. I have not idea where to start and it would be great if someone could assist me with this.


Solution

  • I'd use python:

    1) setup python to work with mysql (loads of tutorials online)

    2) define:

    from collections import defaultdict
    tokenDict = defaultdict(lambda: 0)
    

    the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0)

    3) read each row from the table, tokenize it and increment the token counts

    tokens = row.split(' ') //tokenize
    tokens = [lower(t) for t in tokens] //lowercase
    tokens = set(tokens) //remove duplicates
    for token in tokens:
        tokenDict[token] = tokenDict[token] + 1