I want to get product pairs where consecutive products viewed by the customer in which the second product is part of the recommendations of first product.
e.g. Customer 1 viewed Product P1 and the recommendations for the product are R1-R2-R3-R4 (4 different products) and time stamp t1.
Customer 1 viewed Product P2 and the recommendations for the product(p2) are R11-R12-R13-R14 and time stamp t2.
We need the C1, P1,P2 where P2 is one of the recommendation of P1 i.e. R1/R2/R3/R4
customer_id | product_id(viewed) | recommendation_items | time_stamp |
---|---|---|---|
1 | 123 | 111-098-066-555 | 10-05-2020 |
1 | 111 | 213-012-122 | 10-05-2020 |
2 | 213 | 321-98712-1212-3434-4545 | 10-05-2020 |
2 | 987 | 98711-4567 | 10-05-2020 |
As we can see from the above table for customer 1, he viewed product(123) and in his recommendations item 111 product was there and he also viewed that product(second row product id). But for customer 2 it is not happened. The result table looks like this:
customer_id | product_id_1(viewed) | product_id_2(viewed) |
---|---|---|
1 | 123 | 111 |
Can anybody provide the logic or python code. The following code suggested is matching the string even the portion of the value matches, but we need to match the entire value.
IIUC, you can use a regex to rextract the matching recommendations, then join
:
g = df.groupby('customer_id')
regexes = g['product_id(viewed)'].agg(lambda x: '|'.join(x.astype(str)))
matches = (g['recommendation_items']
.apply(lambda s: s.str.extractall(fr'\b({regexes[s.name]})\b'))
.droplevel(['customer_id', 'match'])[0]
)
# 0 111
# 3 987
# Name: 0, dtype: object
out = (df.join(matches.rename('product_id2(viewed)'), how='right')
[['customer_id', 'product_id(viewed)', 'product_id2(viewed)']]
)
output:
customer_id product_id(viewed) product_id2(viewed)
0 1 123 111
3 2 987 987
To remove self recommendation:
out = (df.join(matches.rename('product_id2(viewed)'), how='right')
[['customer_id', 'product_id(viewed)', 'product_id2(viewed)']]
.loc[lambda d: d['product_id(viewed)'].astype(str).ne(d['product_id2(viewed)'])]
)
output:
customer_id product_id(viewed) product_id2(viewed)
0 1 123 111