python regex pandas data-analysis recommendation-engine

How to get product pairs using pandas for each customer

I want to get product pairs where consecutive products viewed by the customer in which the second product is part of the recommendations of first product.

e.g. Customer 1 viewed Product P1 and the recommendations for the product are R1-R2-R3-R4 (4 different products) and time stamp t1.

Customer 1 viewed Product P2 and the recommendations for the product(p2) are R11-R12-R13-R14 and time stamp t2.

We need the C1, P1,P2 where P2 is one of the recommendation of P1 i.e. R1/R2/R3/R4

customer_id	product_id(viewed)	recommendation_items	time_stamp
1	123	111-098-066-555	10-05-2020
1	111	213-012-122	10-05-2020
2	213	321-98712-1212-3434-4545	10-05-2020
2	987	98711-4567	10-05-2020

As we can see from the above table for customer 1, he viewed product(123) and in his recommendations item 111 product was there and he also viewed that product(second row product id). But for customer 2 it is not happened. The result table looks like this:

customer_id	product_id_1(viewed)	product_id_2(viewed)
1	123	111

Can anybody provide the logic or python code. The following code suggested is matching the string even the portion of the value matches, but we need to match the entire value.

Solution

IIUC, you can use a regex to rextract the matching recommendations, then join:

g = df.groupby('customer_id')
regexes = g['product_id(viewed)'].agg(lambda x: '|'.join(x.astype(str)))
matches = (g['recommendation_items']
           .apply(lambda s: s.str.extractall(fr'\b({regexes[s.name]})\b'))
           .droplevel(['customer_id', 'match'])[0]
          )
# 0    111
# 3    987
# Name: 0, dtype: object

out = (df.join(matches.rename('product_id2(viewed)'), how='right')
         [['customer_id', 'product_id(viewed)', 'product_id2(viewed)']]
      )

output:

   customer_id  product_id(viewed) product_id2(viewed)
0            1                 123                 111
3            2                 987                 987

To remove self recommendation:

out = (df.join(matches.rename('product_id2(viewed)'), how='right')
         [['customer_id', 'product_id(viewed)', 'product_id2(viewed)']]
         .loc[lambda d: d['product_id(viewed)'].astype(str).ne(d['product_id2(viewed)'])]
      )

output:

   customer_id  product_id(viewed) product_id2(viewed)
0            1                 123                 111