Search code examples
pythonlistpython-2.7iterationpython-itertools

Problems computing the score of a pairwise list in an iterative way?


Let's suppose that I have the following lists (actually they have a lot of sublists):

list_1 = [['Hi my name is anon'],
                 ['Hi I like #hokey']]


list_2 = [['Hi my name is anon_2'],
                 ['Hi I like #Basketball']]

I would like to compute the distance of all the possible pairwises with no repetetions (combinations without replacement, product?). For example:

distance between: ['Hi my name is anon'] and ['Hi my name is anon_2']
distance between: ['Hi my name is anon'] and ['Hi I like #Basketball']
distance between: ['Hi I like #hokey'] and ['Hi my name is anon_2']
distance between: ['Hi I like #hokey'] and ['Hi I like #Basketball']

And place the scores into a list like this:

[distance_1,distance_2,distance_3,distance_4]

For this I was thinking on using itertools product or combination. This is what I tried:

strings_1 = [i[0] for i in list_1]
strings_2 = [i[0] for i in list_2]

import itertools

scores_list = [dis.jaccard(i,j) for i,j in zip(itertools.combinations(strings_1, strings_2))]

The problem is I am getting this traceback:

    scores_list = [dis.jaccard(i,j) for i,j in zip(itertools.combinations(strings_1, strings_2))]
TypeError: an integer is required

How can I do efficientely this task and how can I compute this product-combination-like operation?


Solution

  • You need to use itertools.product to get the cartesian product, like this

    [dis.jaccrd(string1, string2) for string1, string2 in product(list_1, list_2)]
    

    The product will group the items, like this

    >>> from pprint import pprint
    >>> pprint(list(product(list_1, list_2)))
    [(['Hi my name is anon'], ['Hi my name is anon_2']),
     (['Hi my name is anon'], ['Hi I like #Basketball']),
     (['Hi I like #hokey'], ['Hi my name is anon_2']),
     (['Hi I like #hokey'], ['Hi I like #Basketball'])]
    

    If you want to apply the jaccrd function only to the strings within the lists, then you might want to preprocess the lists, like this

    >>> list_11 = [item for items in list_1 for item in items]
    >>> list_21 = [item for items in list_2 for item in items]
    >>> pprint([str1 + " " + str2 for str1, str2 in product(list_11, list_21)])
    ['Hi my name is anon Hi my name is anon_2',
     'Hi my name is anon Hi I like #Basketball',
     'Hi I like #hokey Hi my name is anon_2',
     'Hi I like #hokey Hi I like #Basketball']
    >>> pprint([dis.jaccard(str1, str2) for str1, str2 in product(list_11, list_21)])
    ...
    ...
    

    As suggested by Ashwini in the comments, for your case, you can directly use itertools.starmap, like this

    >>> from itertools import product, starmap
    >>> list(starmap(dis.jaccrd, product(list_11, list_21)))
    

    For example,

    >>> list_1 = ["a1", "a2", "a3"]
    >>> list_2 = ["b1", "b2", "b3"]
    >>> from itertools import product, starmap
    >>> list(starmap(lambda x, y: x + " " + y, product(list_1, list_2)))
    ['a1 b1', 'a1 b2', 'a1 b3', 'a2 b1', 'a2 b2', 'a2 b3', 'a3 b1', 'a3 b2', 'a3 b3']