Search code examples
pythonregexpandasdata-analysisrecommendation-engine

How to get product pairs using pandas for each customer


I want to get product pairs where consecutive products viewed by the customer in which the second product is part of the recommendations of first product.

e.g. Customer 1 viewed Product P1 and the recommendations for the product are R1-R2-R3-R4 (4 different products) and time stamp t1.

Customer 1 viewed Product P2 and the recommendations for the product(p2) are R11-R12-R13-R14 and time stamp t2.

We need the C1, P1,P2 where P2 is one of the recommendation of P1 i.e. R1/R2/R3/R4

customer_id product_id(viewed) recommendation_items time_stamp
1 123 111-098-066-555 10-05-2020
1 111 213-012-122 10-05-2020
2 213 321-98712-1212-3434-4545 10-05-2020
2 987 98711-4567 10-05-2020

As we can see from the above table for customer 1, he viewed product(123) and in his recommendations item 111 product was there and he also viewed that product(second row product id). But for customer 2 it is not happened. The result table looks like this:

customer_id product_id_1(viewed) product_id_2(viewed)
1 123 111

Can anybody provide the logic or python code. The following code suggested is matching the string even the portion of the value matches, but we need to match the entire value.


Solution

  • IIUC, you can use a regex to rextract the matching recommendations, then join:

    g = df.groupby('customer_id')
    regexes = g['product_id(viewed)'].agg(lambda x: '|'.join(x.astype(str)))
    matches = (g['recommendation_items']
               .apply(lambda s: s.str.extractall(fr'\b({regexes[s.name]})\b'))
               .droplevel(['customer_id', 'match'])[0]
              )
    # 0    111
    # 3    987
    # Name: 0, dtype: object
    
    out = (df.join(matches.rename('product_id2(viewed)'), how='right')
             [['customer_id', 'product_id(viewed)', 'product_id2(viewed)']]
          )
    

    output:

       customer_id  product_id(viewed) product_id2(viewed)
    0            1                 123                 111
    3            2                 987                 987
    

    To remove self recommendation:

    out = (df.join(matches.rename('product_id2(viewed)'), how='right')
             [['customer_id', 'product_id(viewed)', 'product_id2(viewed)']]
             .loc[lambda d: d['product_id(viewed)'].astype(str).ne(d['product_id2(viewed)'])]
          )
    

    output:

       customer_id  product_id(viewed) product_id2(viewed)
    0            1                 123                 111