Search code examples
pythonfuzzy

Python - Fuzzy matching result in new column for category based on ratio over 80


I would like to scan a folder to pick up all the files end with '.txt' and then create a data frame by creating a new column for categorization with similar file names (partial score of ratio >=80)

import os
path = '../../../files'
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
text_files 

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

s1 = "programmi.txt"
s2 = "programmi-2.txt"
fuzz.ratio(s1, s2)

The result I expect to see is like below:

enter image description here


Solution

  • Here's a solution which uses two for loops to compare each text to all the others to obtain the fuzz ratio needed for the categorisations.

    import pandas as pd
    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    
    txt_list = [
        "programmi.txt",
        "readl-001.txt",
        "dict_class124.txt",
        "readl-002.txt",
        "programmi-2.txt",
        "programmi-re.txt",
        "readl-003.txt",
        "dict_class125.txt",
        "dict_class1264.txt",
        "hello world"
    
    ]
    
    list_categorised_texts = []
    txt_category = []
    category_index = 0
    threshold = 80
    
    # two for loops since we need to compare each text to all the others
    for txt_1 in txt_list:
    
        if txt_1 not in list_categorised_texts:  # if the first text of the current pair is not yet categorised, add as new category
            category_index += 1
            list_categorised_texts.append(txt_1)
            txt_category.append(category_index)
            
        for txt_2 in txt_list:
            
            if txt_1 == txt_2:  # we don't want to compare the same texts
                continue
            
            elif txt_2 in list_categorised_texts:  # skip already classified texts
                continue
    
            else:  # if the txt_2 is similar, add to list of classified texts with corresponding category
                similarity = fuzz.ratio(txt_1, txt_2)
                if similarity >= threshold:
                    list_categorised_texts.append(txt_2)
                    txt_category.append(category_index)
            
                
    data = {
        'texts': list_categorised_texts,
        'category': txt_category
    }
    
    df = pd.DataFrame(data)
    print(df.to_markdown())
    

    Result:

    |    | texts              |   category |
    |---:|:-------------------|-----------:|
    |  0 | programmi.txt      |          1 |
    |  1 | programmi-2.txt    |          1 |
    |  2 | programmi-re.txt   |          1 |
    |  3 | readl-001.txt      |          2 |
    |  4 | readl-002.txt      |          2 |
    |  5 | readl-003.txt      |          2 |
    |  6 | dict_class124.txt  |          3 |
    |  7 | dict_class125.txt  |          3 |
    |  8 | dict_class1264.txt |          3 |
    |  9 | hello world        |          4 |
    

    Warning:

    Please note that this approach has an order-dependency: In the example below, comparing dict_cl.txt to the other names only leads to one match, while comparing dict_class12.txt to all other names leads to 3 matches. For your use case, where we assume that each group is very distinct from each other, this should not be a problem. However, this example shows that pairwise comparisons are a bit tricky in more sophisticated situations.

    print(fuzz.ratio('dict_cl.txt', 'dict_class125.txt'))  # 79 -> not same category
    print(fuzz.ratio('dict_cl.txt', 'dict_class1264.txt'))  # 76 -> not same category
    print(fuzz.ratio('dict_cl.txt', 'dict_class12.txt'))  # 81 -> same category
    print("###")
    print(fuzz.ratio('dict_class12.txt', 'dict_cl.txt'))  # 81 -> same category
    print(fuzz.ratio('dict_class12.txt', 'dict_class125.txt'))  # 97 -> same category
    print(fuzz.ratio('dict_class12.txt', 'dict_class1264.txt'))  # 94 -> same category