python pandas pyspark string-matching pandas-loc

Python DF issue. Match on the basis of column value

I have a DF which is quite big. A snippet like the one shown below.

SrNo | Merchant | Revenue | Currency

1    | UBER SR  | 123     | INR

2    | UBER (SR)| 123     | INR

3    | SR UBER  | 123     | INR

4    | ZOMATO SR| 123     | INR

5    | ZOMATOSR | 123     | INR

6    |12FLIPAKRT| 123     | INR

7    | FLIPKART | 123     | INR

My Output should look like:

SrNo | Merchant | Revenue | Currency |Merchant_Flag

1    | UBER SR  | 123     | INR      | UBER

2    | UBER (SR)| 123     | INR      | UBER

3    | SR UBER  | 123     | INR      | UBER

4    | ZOMATO SR| 123     | INR      | ZOMATO

5    | ZOMATOSR | 123     | INR      | ZOMATO

6    |12FLIPAKRT| 123     | INR      | FLIPKART

7    | FLIPKART | 123     | INR      | FLIPKART

Explanation : I want to add an additional column which should have values wrt to Merchant column i.e. if the Merchant column value has UBER in it, Merchant_Flag should be UBER and likewise for other ZOMATO, FLIPKART.

My Dataset is huge. I tried using re.search and then .replace using if and else for my conditions, it is giving me performance issue. Another solution, I tried was using .loc

df.loc[df['columnname'].str.contains('')]. Not sure how to proceed. Can someone help on this.

Solution

I would use the pandas.DataFrame.applymap method, it is always (as a rule of thumb) better in terms of performance to use built-in methods when performing iterative operations over an object instead of coding a plain for/while loop in python. This way you get rid of the overhead of calling the pandas.DataFrame.loc accessor multiple times.

import re # import regex matching

df = pd.DataFrame({"Merchant":["UBER SR", "SR UBER", "ZOMATO SR", "12FLIPKART"]})

# Use applymap to quickly iterate over the whole dataframe
df["Merchant_Flag"] = df.applymap(lambda x: re.search(r"UBER|ZOMATO|FLIPKART", x).group(0))

(Note: .group(0) is used to select the first match of re.search)

Output:

     Merchant Merchant_Flag
0     UBER SR          UBER
1     SR UBER          UBER
2   ZOMATO SR        ZOMATO
3  12FLIPKART      FLIPKART

EDIT

In case you don't know all the merchants' names, you will need to modify a little bit the function passed to applymap:

import re # import regex matching

# Function definition
def match_merchant(elem, reg):
    # Match 
    m = re.search(reg, elem)
    if m != None:
        # Return first match if we got a match
        return m.group(0)
    else:
        # Here you may specify exactly what it is you want to return: None, empty String...
        return None 
    
df = pd.DataFrame({"Merchant":["UBER SR",
                               "SR UBER",
                               "ZOMATO SR",
                               "12FLIPKART",
                               "CABIFY"]})

# Merchants you want to match
regex = r"UBER|ZOMATO|FLIPKART"

# Use applymap to quickly iterate over the whole dataframe
df["Merchant_Flag"] = df.applymap(lambda x: match_merchant(x, regex))

Output:

     Merchant Merchant_Flag
0     UBER SR          UBER
1     SR UBER          UBER
2   ZOMATO SR        ZOMATO
3  12FLIPKART      FLIPKART
4      CABIFY          None