Search code examples
pythonnlptraining-datapos-tagger

Create a code in python to get the most frequent tag and value pair from a list


I have a .txt file with 3 columns: word position, word and tag (NN, VB, JJ, etc.).

Example of txt file:

1   i   PRP

2   want    VBP

3   to  TO

4   go  VB

I want to find the frequency of the word and tag as a pair in the list in order to find the most frequently assigned tag to a word. Example of Results: 3 (food, NN), 2 (Brave, ADJ)

My idea is to start by opening the file from the folder, read the file line by line and split, set a counter using dictionary and print with the most common to uncommon in descending order.

My code is extremely rough (I'm almost embarrassed to post it):

file=open("/Users/Desktop/Folder1/trained.txt")
wordcount={}
for word in file.read().split():
    from collections import Counter
    c = Counter()
    for d in dicts.values():
        c += Counter(d)

print(c.most_common())

file.close()

Obviously, i'm getting no results. Anything will help. Thanks.

UPDATE:

so i got this code posted on here which worked, but my results are kinda funky. here's the code (the author removed it so i don't know who to credit):

file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n')

d = {}
for i in file:
    if i[1:] in d.keys():
        d[i[1:]] += 1
    else:
        d[i[1:]] = 1

print (sorted(d.items(), key=lambda x: x[1], reverse=True))

here are my results:

[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),

there seem to be a mix of good results with words and other results with space or random numbers, anyone know a way to remove what aren't real words? also, i know \t is supposed to signify a tab, is there a way to remove that as well? you guys really helped a lot


Solution

  • You need to have a separate collections.Counter for each word. This code uses defaultdict to create a dictionary of counters, without checking every word to see if it is known.

    from collections import Counter, defaultdict
    
    counts = defaultdict(Counter)
    for row in file:           # read one line into `row`
        if not row.strip():
            continue           # ignore empty lines
        pos, word, tag = row.split()
        counts[word.lower()][tag] += 1
    

    That's it, you can now check the most common tag of any word:

    print(counts["food"].most_common(1))
    # Prints [("NN", 3)] or whatever