Search code examples
pythonlistcomparisonfuzzywuzzy

More efficient string comparison in list


I'm writing a program to detect court cases cited in a large number of texts from different sources, and count how many times each is cited across texts. The problem stems from the fact that cases exist in two states within most documents: They are referenced in full once or twice in a given text (ie "Timothy Ivory Carpenter v. United States of America") and then are referred to in a shorter form throughout the rest of the document (ie "Carpenter v. U.S."). Currently I am appending the titles to a massive list and then running the list thru the FuzzyWuzzy string similarity detection package.

Right now I am comparing each variable against every other variable each time, which is INSANELY inefficient. My question is: Is there a way to only run the comparison's that haven't been executed? I know I could cut the list in half and make it slightly more efficient, but I'd still be comparing 1/2 the list against itself. My other thought was to create a list of which had been compared and cross reference each pair and it's mirrored version (ie both "1:5" and "5:1"), in terms of processing time, that ends up being 34% slower than just brute forcing it.

if variable_list = ['1','2','3','4','5']

What I'm doing now is comparing each of the following:

1:1 1:2 1:3 1:4 1:5
2:1 2:2 2:3 2:4 2:5
3:1 3:2 3:3 3:4 3:5
4:1 4:2 4:3 4:4 4:5
5:1 5:2 5:3 5:4 5:5

which is terrible, I know

Is there a way to get python to run this instead?

One part of the problem is that you can't just check if the comparison has been run because the overlapped checks are mirrored, not identical (ie the redundancy of 3:1 is that 1:3 has been compared already, not 3:1)

I don't neeed the self reference, but I included it since I assume this would be based on a checking if the comparison has been done.

1:1 1:2 1:3 1:4 1:5
2:2 2:3 2:4 2:5
3:3 3:4 3:5
4:4 4:5
5:5

Code:

for var_1 in variable_list:

    for var_2 in variable_list:

    ### The chunk below sets the parameters to filter the strings

        if fuzz.token_set_ratio(var_2_reg_filt, var_1_reg_filt) > 91:

            if fuzz.ratio(var_2_reg_filt, var_1_reg_filt) > 87:

                if fuzz.partial_ratio(var_2_reg_filt, var_1_reg_filt) > 87:

                    if fuzz.token_sort_ratio(var_2_reg_filt, var_1_reg_filt) > 83:

                        ### This code then removes the longer of the two strings
                        ### -- and replaces it with the shorter version

                        if (len(var_1_reg_filt)) > (len(var_2_reg_filt)):

                            <<< Code to Replace var_1 with var_2 >>>

                        if (len(var_1_reg_filt)) < (len(var_2_reg_filt)):

                            <<< Code to Replace var_1 with var_2 >>>

This isn't an "error" code problem, but a conceptual one instead. I included the code to show what I was doing: Running each iteration of var_1 : var_2 through three different filters to cull close but incorrect matches.


Solution

  • I assume you want combinations of your variables see the combinations iterator in itertools package

    import itertools
    vars = ['1','2','3','4','5']
    list(itertools.combinations(vars,2))
    >>>
    [('1', '2'),
     ('1', '3'),
     ('1', '4'),
     ('1', '5'),
     ('2', '3'),
     ('2', '4'),
     ('2', '5'),
     ('3', '4'),
     ('3', '5'),
     ('4', '5')]
    

    if you need to compare them to themselves you can use combinations_with_replacement

    so in the end your loop should look like

    for var_1, var_2 in itertools.combinations(variable_list,2):
        ....