Search code examples
pythonmatrixtuplessimilarity

list of tuples into similarity matrix


I need to transform measured data between all pairs saved as list of tuples into a similarity matrix:

input tuples = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('b', 'c', 4), ('b', 'd', 5), ('c', 'd', 6)]

and want to have a list of lists (similarity matrix) to use later for analysis of data (e.g heatplot, etc.):

output list = [[0, 1, 2, 3], [1, 0, 4, 5], [2, 4, 0, 6], [3, 5, 6, 0]]

Here I wrote a code just to test on a example dataset I use above:

unique_values = ['a', 'b', 'c', 'd']

tuples = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('b', 'c', 4),
          ('b', 'd', 5), ('c', 'd', 6)]

big_list = []
for value in unique_values:
    idx = unique_values.index(value)
    small_list = []
    small_list.insert(idx, 0)
    N = idx
    for tpl in tuples:

        if tpl[0] == value and tpl[1] == unique_values[1+N]:
            N += 1
            rmsd = tpl[2]
            small_list.insert(N, rmsd)

        elif (N % 2) == 0 and tpl[0] == value and tpl[1] == unique_values[N]:
            rmsd = tpl[2]
            small_list.insert(N, rmsd)

        elif (N % 2) != 0 and tpl[0] == value and tpl[1] == unique_values[idx+1]:
            rmsd = tpl[2]
            small_list.insert(idx+1, rmsd)

        elif (N % 2) != 0 and tpl[1] == value and tpl[0] == unique_values[abs(idx-1)]:
            rmsd = tpl[2]
            small_list.insert(idx-1, rmsd)

        elif (N % 2) != 0 and tpl[1] == value and tpl[0] == unique_values[abs(idx-N)]:
            rmsd = tpl[2]
            small_list.insert(idx-N, rmsd)
            N += 1

        elif (N % 2) == 0 and tpl[1] == value and tpl[0] == unique_values[abs(idx-N)]:
            rmsd = tpl[2]
            small_list.insert(idx-N, rmsd)
            N -= 1

    big_list.append(small_list)

print('big_list', big_list)

But this code works only on this small dataset and if I increase it, for e.g.:

unique_values = ['a', 'b', 'c', 'd', 'e']

tuples = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('a', 'e', 4), ('b', 'c', 5), ('b', 'd', 6), ('b', 'e', 7), ('c', 'd', 8), ('c', 'e', 9), ('d', 'e', 10)]

it gives already wrong output:

[[0, 1, 2, 3, 4], [1, 0, 5, 6, 7], [2, 5, 0, 8], [3, 6, 8, 0, 10], [4, 7, 0]]

I do not see how to write this algorithm for a dataset of any size.

I would like to ask, if someone can help to fix this code or maybe there is already such a package in python.


Solution

  • This will do what you want:

    
    def makeSimMatrix( unique, data ):
        array = [[0]*len(unique) for _ in range(len(unique))]
        for row in data:
            i1 = unique.index(row[0])
            i2 = unique.index(row[1])
            array[i1][i2] = row[2]
            array[i2][i1] = row[2]
        return array
    
    unique1 = 'abcd'
    in1 = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('b', 'c', 4), ('b', 'd', 5), ('c', 'd', 6)]
    print(makeSimMatrix(unique1,in1))
    
    unique2 = 'abcde'
    in2 = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('a', 'e', 4), ('b', 'c', 5), ('b', 'd', 6), ('b', 'e', 7), ('c', 'd', 8), ('c', 'e', 9), ('d', 'e', 10)]
    print(makeSimMatrix(unique2,in2))
    

    Output:

    [[0, 1, 2, 3], [1, 0, 4, 5], [2, 4, 0, 6], [3, 5, 6, 0]]
    [[0, 1, 2, 3, 4], [1, 0, 5, 6, 7], [2, 5, 0, 8, 9], [3, 6, 8, 0, 10], [4, 7, 9, 10, 0]]
    

    You could derive the "unique" information from the data using another pass, if you didn't have the set available.