I need to transform measured data between all pairs saved as list of tuples into a similarity matrix:
input tuples = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('b', 'c', 4), ('b', 'd', 5), ('c', 'd', 6)]
and want to have a list of lists (similarity matrix) to use later for analysis of data (e.g heatplot, etc.):
output list = [[0, 1, 2, 3], [1, 0, 4, 5], [2, 4, 0, 6], [3, 5, 6, 0]]
Here I wrote a code just to test on a example dataset I use above:
unique_values = ['a', 'b', 'c', 'd']
tuples = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('b', 'c', 4),
('b', 'd', 5), ('c', 'd', 6)]
big_list = []
for value in unique_values:
idx = unique_values.index(value)
small_list = []
small_list.insert(idx, 0)
N = idx
for tpl in tuples:
if tpl[0] == value and tpl[1] == unique_values[1+N]:
N += 1
rmsd = tpl[2]
small_list.insert(N, rmsd)
elif (N % 2) == 0 and tpl[0] == value and tpl[1] == unique_values[N]:
rmsd = tpl[2]
small_list.insert(N, rmsd)
elif (N % 2) != 0 and tpl[0] == value and tpl[1] == unique_values[idx+1]:
rmsd = tpl[2]
small_list.insert(idx+1, rmsd)
elif (N % 2) != 0 and tpl[1] == value and tpl[0] == unique_values[abs(idx-1)]:
rmsd = tpl[2]
small_list.insert(idx-1, rmsd)
elif (N % 2) != 0 and tpl[1] == value and tpl[0] == unique_values[abs(idx-N)]:
rmsd = tpl[2]
small_list.insert(idx-N, rmsd)
N += 1
elif (N % 2) == 0 and tpl[1] == value and tpl[0] == unique_values[abs(idx-N)]:
rmsd = tpl[2]
small_list.insert(idx-N, rmsd)
N -= 1
big_list.append(small_list)
print('big_list', big_list)
But this code works only on this small dataset and if I increase it, for e.g.:
unique_values = ['a', 'b', 'c', 'd', 'e']
tuples = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('a', 'e', 4), ('b', 'c', 5), ('b', 'd', 6), ('b', 'e', 7), ('c', 'd', 8), ('c', 'e', 9), ('d', 'e', 10)]
it gives already wrong output:
[[0, 1, 2, 3, 4], [1, 0, 5, 6, 7], [2, 5, 0, 8], [3, 6, 8, 0, 10], [4, 7, 0]]
I do not see how to write this algorithm for a dataset of any size.
I would like to ask, if someone can help to fix this code or maybe there is already such a package in python.
This will do what you want:
def makeSimMatrix( unique, data ):
array = [[0]*len(unique) for _ in range(len(unique))]
for row in data:
i1 = unique.index(row[0])
i2 = unique.index(row[1])
array[i1][i2] = row[2]
array[i2][i1] = row[2]
return array
unique1 = 'abcd'
in1 = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('b', 'c', 4), ('b', 'd', 5), ('c', 'd', 6)]
print(makeSimMatrix(unique1,in1))
unique2 = 'abcde'
in2 = [('a', 'b', 1), ('a', 'c', 2), ('a', 'd', 3), ('a', 'e', 4), ('b', 'c', 5), ('b', 'd', 6), ('b', 'e', 7), ('c', 'd', 8), ('c', 'e', 9), ('d', 'e', 10)]
print(makeSimMatrix(unique2,in2))
Output:
[[0, 1, 2, 3], [1, 0, 4, 5], [2, 4, 0, 6], [3, 5, 6, 0]]
[[0, 1, 2, 3, 4], [1, 0, 5, 6, 7], [2, 5, 0, 8, 9], [3, 6, 8, 0, 10], [4, 7, 9, 10, 0]]
You could derive the "unique" information from the data using another pass, if you didn't have the set available.