Search code examples
python-2.7listjython-2.7

Python pairs have multiple copies of a word in list


So I have the following code:

def stripNonAlphaNum(text):
    import re
    return re.compile(r'\W+', re.UNICODE).split(text)

def readText(fileStub):
  words = open(fileStub, 'r').read()
  words = words.lower() # Make it lowercase
  wordlist = sorted(stripNonAlphaNum(words))
  wordfreq = []
  for w in wordlist: # Increase count of one upon every iteration of the word.
    wordfreq.append(wordlist.count(w))
  return list(zip(wordlist, wordfreq))

It reads a file in, and then makes pairs of the word and frequency in which they occur. The issue I'm facing is that when I print the result, I don't get the proper pair counts.

If I have some input given, I might get output like this:

('and', 27), ('and', 27), ('and', 27), ('and', 27), ('and', 27), ('and', 27), ('and', 27),.. (27 times)

Which is NOT what I want it to do.

Rather I would like it to give 1 output of the word and just one number like so:

('and', 27), ('able', 5), ('bat', 6).. etc

So how do I fix this?


Solution

  • You should consider using a dictionary. Dictionaries work like hash maps, thus allow associative indexing; in this way duplicates are not an issue.

    ...
      wordfreq = {}
      for w in wordlist: 
        wordfreq[w] = wordlist.count(w)
      return wordfreq
    

    If you really need to return a list, just do return wordfreq.items()

    The only problem with this approach is that you will unnecessarily compute the wordlist.count() method more than once for each word. To avoid this issue, write for w in set(wordlist):

    Edit for additional question: if you are ok with returning a list, just do return sorted(wordfreq.items(), key=lambda t: t[1]). If you omit the key part, the result will be ordered by the word first, then the value