Search code examples
pythontkinterglobtxt

Text Comparing Program


I'm making a program where I am supposed to compare text files by returning a list of all the words that come up in the file, and the number of times they come up. I have to disregard a list of words called stopwords so they won't be checked for the number of times they come up. For the first part I need to check if the word is in the stopwords, if it is, i don't count that word, if it isn't in stopwords then I make a brand new row for that word in a dataframe, assuming it doesn't already exist in the data frame, and increment the appearance frequency by 1. Each text file will have a column. I am a little stuck on this part however. I have bits of the code already but I need to fill in the blanks. Here is what I have so far:

from tkinter.filedialog import askdirectory
import glob

import os 
import pandas as pd


def main():
    df = pd.DataFrame(columns =["TEXT FILE NAMES HERE..."])
    data_directory = askdirectory(initialdir = "/School_Files/CISC_121/Assignments/Assignment3/Data_Files")
    stopwords = open(os.getcwd() + "/" + "StopWords.txt") 



    text_files = glob.glob(data_directory + "/" + "*.txt")



    for f in text_files:
        infile = open(f, "r", encoding = "UTF-8")
        #now read the file and do all the word-counting etc...
        lines = infile.readlines()
        for line in lines:
            x = 0
            words = line.split()
            while (x < len(words)):
                """
                Check if the word is in the stopwords
                If it isn't, then add the word into a row in a dataframe, for the first occurence, then
                increment the value by 1
                Have a column for each book 
                """
                for line in infile:
                    if word in line:
                        found = True
                        word +=1 
                    else:
                        found = False

                x = x+1

main()

If anyone can help me finish this section I'd really appreciate it. Please show the change in code. Thanks in advance!


Solution

  • I see that you just want to count the occurrence of the words. For this you could use a dictionary instead of a Dataframe.

    And for stopwords, read it to a list.

    Try the below code.

    stopwords = []
    count_dictionary {}
    
    with open(os.getcwd() + "/" + "StopWords.txt") as f:
        stopwords = f.read().splitlines()
    
    #your code
    
    while (x < len(words)):
        if word not in stopwords:
            if word in count_dictionary :
                count_dictionary[word] += 1
            else:
                count_dictionary[word] = 1