Search code examples
pythonlinecpu-word

How can I effectively pull out human readable strings/terms from code automatically?


I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.

Example - For this line of code found in a file:

for w in sorted(strings, key=strings.get, reverse=True):

I'd want these unique strings/terms returned to my dictionary as keys:

for
w
in
sorted
strings
key
strings
get
reverse
True

However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:

strings.get

How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":

strings = dict()
fname = '/tmp/bigfile.txt'

with open(fname, "r") as f:
    for line in f:
        if line in strings:
            strings[line] += 1
        else:
            strings[line] = 1

for w in sorted(strings, key=strings.get, reverse=True):
    print str(w).rstrip() + " : " + str(strings[w])

(Yes I used code from my little snippet here as the example at the top.)


Solution

  • If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression

    \w+\.?\w*
    

    Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"

    note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.

    then you can use collections.Counter to do the actual counting for you:

    import collections
    import re
    
    pattern = re.compile(r"\w+\.?\w*")
    
    #here I'm using the source file for `collections` as the test example
    with open(collections.__file__, "r") as f:
        tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
        for token, count in tokens.most_common(5): #show only the top 5
            print(token, count)
    

    Running python version 3.6.0a1 the output is this:

    self 226
    def 173
    return 170
    self.data 129
    if 102
    

    which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.