Let's suppose that I have the following lists (actually they have a lot of sublists):
list_1 = [['Hi my name is anon'],
['Hi I like #hokey']]
list_2 = [['Hi my name is anon_2'],
['Hi I like #Basketball']]
I would like to compute the distance of all the possible pairwises with no repetetions (combinations without replacement, product?). For example:
distance between: ['Hi my name is anon'] and ['Hi my name is anon_2']
distance between: ['Hi my name is anon'] and ['Hi I like #Basketball']
distance between: ['Hi I like #hokey'] and ['Hi my name is anon_2']
distance between: ['Hi I like #hokey'] and ['Hi I like #Basketball']
And place the scores into a list like this:
[distance_1,distance_2,distance_3,distance_4]
For this I was thinking on using itertools product or combination. This is what I tried:
strings_1 = [i[0] for i in list_1]
strings_2 = [i[0] for i in list_2]
import itertools
scores_list = [dis.jaccard(i,j) for i,j in zip(itertools.combinations(strings_1, strings_2))]
The problem is I am getting this traceback:
scores_list = [dis.jaccard(i,j) for i,j in zip(itertools.combinations(strings_1, strings_2))]
TypeError: an integer is required
How can I do efficientely this task and how can I compute this product-combination-like operation?
You need to use itertools.product
to get the cartesian product, like this
[dis.jaccrd(string1, string2) for string1, string2 in product(list_1, list_2)]
The product will group the items, like this
>>> from pprint import pprint
>>> pprint(list(product(list_1, list_2)))
[(['Hi my name is anon'], ['Hi my name is anon_2']),
(['Hi my name is anon'], ['Hi I like #Basketball']),
(['Hi I like #hokey'], ['Hi my name is anon_2']),
(['Hi I like #hokey'], ['Hi I like #Basketball'])]
If you want to apply the jaccrd
function only to the strings within the lists, then you might want to preprocess the lists, like this
>>> list_11 = [item for items in list_1 for item in items]
>>> list_21 = [item for items in list_2 for item in items]
>>> pprint([str1 + " " + str2 for str1, str2 in product(list_11, list_21)])
['Hi my name is anon Hi my name is anon_2',
'Hi my name is anon Hi I like #Basketball',
'Hi I like #hokey Hi my name is anon_2',
'Hi I like #hokey Hi I like #Basketball']
>>> pprint([dis.jaccard(str1, str2) for str1, str2 in product(list_11, list_21)])
...
...
As suggested by Ashwini in the comments, for your case, you can directly use itertools.starmap
, like this
>>> from itertools import product, starmap
>>> list(starmap(dis.jaccrd, product(list_11, list_21)))
For example,
>>> list_1 = ["a1", "a2", "a3"]
>>> list_2 = ["b1", "b2", "b3"]
>>> from itertools import product, starmap
>>> list(starmap(lambda x, y: x + " " + y, product(list_1, list_2)))
['a1 b1', 'a1 b2', 'a1 b3', 'a2 b1', 'a2 b2', 'a2 b3', 'a3 b1', 'a3 b2', 'a3 b3']