Word & Line Concordance Program

I originally posted this question here but was then told to post it to code review; however, they told me that my question needed to be posted here instead. I will try to better explain my problem so hopefully there is no confusion. I am trying to write a word-concordance program that will do the following:

1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that you’re timing) containing only stop words, called stopWordDict. (WARNING: Strip the newline(‘\n’) character from the end of the stop word before adding it to stopWordDict)

2) Process the WarAndPeace.txt file one line at a time to build the word-concordance dictionary(called wordConcordanceDict) containing “main” words for the keys with a list of their associated line numbers as their values.

3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance words printed out in alphabetical order along with their corresponding line numbers.

I tested my program on a small file with a short list of stop words and it worked correctly (provided an example of this below). The outcome was what I expected, a list of the main words with their line count, not including words from the stop_words_small.txt file. The only difference between the small file I tested and the main file I am actually trying to test, is the main file is much longer and contains punctuation. So the problem I am running into is when I run my program with the main file, I am getting way more results then expected. The reason I am getting more results then expected is because the punctuation is not being removed from the file.

For example, below is a section of the outcome where my code counted the word Dmitri as four separate words because of the different capitalization and punctuation that follows the word. If my code were to remove the punctuation correctly, the word Dmitri would be counted as one word followed by all the locations found. My output is also separating upper and lower case words, so my code is not making the file lower case either.

What my code currently displays:

Dmitri : [2528, 3674, 3687, 3694, 4641, 41131]

Dmitri! : [16671, 16672]

Dmitri, : [2530, 3676, 3685, 13160, 16247]

dmitri : [2000]

What my code should display:

dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]

Words are defined to be sequences of letters delimited by any non-letter. There should also be no distinction made between upper and lower case letters, but my program splits those up as well; however, blank lines are to be counted in the line numbering.

Below is my code and I would appreciate it if anyone could take a look at it and give me any feedback on what I am doing wrong. Thank you in advance.

import re

def main():
    stopFile = open("stop_words.txt","r")
    stopWordDict = dict()

    for line in stopFile:
        stopWordDict[line.lower().strip("\n")] = []

    hwFile = open("WarAndPeace.txt","r")
    wordConcordanceDict = dict()
    lineNum = 1

    for line in hwFile:
        wordList = re.split(" |\n|\.|\"|\)|\(", line)
        for word in wordList:
            word.strip(' ')
            if (len(word) != 0) and word.lower() not in stopWordDict:
                if word in wordConcordanceDict:
                    wordConcordanceDict[word].append(lineNum)
                else:
                    wordConcordanceDict[word] = [lineNum]
        lineNum = lineNum + 1

    for word in sorted(wordConcordanceDict):
        print (word," : ",wordConcordanceDict[word])


if __name__ == "__main__":
main()

Just as another example and reference here is the small file I test with the small list of stop words that worked perfectly.

stop_words_small.txt file

a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was

small_file.txt

This is a sample data (text) file to
be processed by your word-concordance program.

The real data file is much bigger.

correct output

bigger: 4

concordance: 2

data: 1 4

file: 1 4

much: 4

processed: 2

program: 2

real: 4

sample: 1

text: 1

word: 2

your: 2

Solution

You can do it like this:

import re
from collections import defaultdict

wordConcordanceDict = defaultdict(list)

with open('stop_words_small.txt') as sw:
    words = (line.strip() for line in sw)
    stop_words = set(words)

with open('small_file.txt') as f:
    for line_number, line in enumerate(f, 1):
        words = (re.sub(r'[^\w\s]','',word).lower() for word in line.split())
        good_words = (word for word in words if word not in stop_words)
        for word in good_words:
            wordConcordanceDict[word].append(line_number)

for word in sorted(wordConcordanceDict):
    print('{}: {}'.format(word, ' '.join(map(str, wordConcordanceDict[word]))))

Output:

bigger: 4
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
wordconcordance: 2
your: 2

I will add explanations tomorrow, it's getting late here ;). Meanwhile, you can ask in the comments if some part of the code isn't clear for you.