Search code examples
python-3.xfiledocument-classification

Python code not correctly segregating documents into folders based on keywords


I am trying to segregate documents into different folders based on whether some keywords(keyword1 and keyword2) occur in the text present in the document or not. I am using regex for this purpose.

Case 1 : If keyword1 occurs then create a folder named keyword1 and store that document in it Case 2 : If keyword2 occurs then create a folder named keyword2 and store that document in it Case 3 : If neither of the keyword occurs then create an unknown folder and store those documents in it.

The logic is working fine for the first 2 cases but it is not working for the last case. If neither of the keywords appear even then the documents are getting stored in the keyword2 folder.

Below is my python implementation:

keyword = "keyword1"
for k, text_list in text_dict.items():
    file_name1 = k.split('.')[0]
    match = re.search(r"keyword1", text_list, flags = re.DOTALL|re.IGNORECASE)

    if match:
        print(f"The keyword '{keyword}' is present in the text.--->", k)
        os.makedirs('keyword1', exist_ok = True)
        shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword1', file_name1), dirs_exist_ok=True)

    elif not match:
        print(f"The keyword '{keyword2}' is present in the text.--->", k)
        os.makedirs('keyword2', exist_ok = True)
        shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword2', file_name1), dirs_exist_ok=True)

    else:
        print(f"The keywords '{keyword1}' and {keyword2} are not present in the text.--->", k)
        os.makedirs('unknown', exist_ok = True)
        shutil.copytree(os.path.join('imgs', file_name1), os.path.join('unknown', file_name1), dirs_exist_ok=True)

text_list is a dictionary where the keys are the filename and the values are the text present in the file. Basically it will iterate through the dictionary and search for they keywords in the values. If found i.e. match is True then it will create a folder of that name and store the file in it.

The issue is in the last else, if neither of the keywords are found then an unknown folder should be created and those files should be stored in that folder. But those files are being stored in the keyword2 folder.

Any help is appreciated!


Solution

  • Your current implemented logic cannot work because as pointed out in the comments, there would only be 2 cases, whether match1 is True or False. Hence your 3rd else loop is never getting satisfied.

    What you can do is create 2 regex expressions for keyword1 and keyword2. Then you can check whether the conditions are getting satisfied.

    Try this:

    keyword = "keyword1"
    keyword = "keyword2"
    for k, text_list in text_dict.items():
        file_name1 = k.split('.')[0]
        match1 = re.search(r"keyword1", text_list, flags = re.DOTALL|re.IGNORECASE)
        match2 = re.search(r"keyword2", text_list, flags = re.DOTALL|re.IGNORECASE)
    
        if (match1) and (not match2):
            print(f"The keyword '{keyword}' is present in the text.--->", k)
            os.makedirs('keyword1', exist_ok = True)
            shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword1', file_name1), dirs_exist_ok=True)
    
        elif (not match1) and (match2):
            print(f"The keyword '{keyword2}' is present in the text.--->", k)
            os.makedirs('keyword2', exist_ok = True)
            shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword2', file_name1), dirs_exist_ok=True)
    
        elif (not match1) and (not match2):
            print(f"The keywords '{keyword1}' and {keyword2} are not present in the text.--->", k)
            os.makedirs('unknown', exist_ok = True)
            shutil.copytree(os.path.join('imgs', file_name1), os.path.join('unknown', file_name1), dirs_exist_ok=True)
    
        else:
            print('whatever')
    

    Cheers!