Search code examples
pythonpandaspysparkstring-matchingpandas-loc

Python DF issue. Match on the basis of column value


I have a DF which is quite big. A snippet like the one shown below.

enter image description here

SrNo | Merchant | Revenue | Currency

1    | UBER SR  | 123     | INR

2    | UBER (SR)| 123     | INR

3    | SR UBER  | 123     | INR

4    | ZOMATO SR| 123     | INR

5    | ZOMATOSR | 123     | INR

6    |12FLIPAKRT| 123     | INR

7    | FLIPKART | 123     | INR

My Output should look like:

SrNo | Merchant | Revenue | Currency |Merchant_Flag

1    | UBER SR  | 123     | INR      | UBER

2    | UBER (SR)| 123     | INR      | UBER

3    | SR UBER  | 123     | INR      | UBER

4    | ZOMATO SR| 123     | INR      | ZOMATO

5    | ZOMATOSR | 123     | INR      | ZOMATO

6    |12FLIPAKRT| 123     | INR      | FLIPKART

7    | FLIPKART | 123     | INR      | FLIPKART

Explanation : I want to add an additional column which should have values wrt to Merchant column i.e. if the Merchant column value has UBER in it, Merchant_Flag should be UBER and likewise for other ZOMATO, FLIPKART.

My Dataset is huge. I tried using re.search and then .replace using if and else for my conditions, it is giving me performance issue. Another solution, I tried was using .loc

df.loc[df['columnname'].str.contains('')]. Not sure how to proceed. Can someone help on this.


Solution

  • I would use the pandas.DataFrame.applymap method, it is always (as a rule of thumb) better in terms of performance to use built-in methods when performing iterative operations over an object instead of coding a plain for/while loop in python. This way you get rid of the overhead of calling the pandas.DataFrame.loc accessor multiple times.

    import re # import regex matching
    
    df = pd.DataFrame({"Merchant":["UBER SR", "SR UBER", "ZOMATO SR", "12FLIPKART"]})
    
    # Use applymap to quickly iterate over the whole dataframe
    df["Merchant_Flag"] = df.applymap(lambda x: re.search(r"UBER|ZOMATO|FLIPKART", x).group(0))
    
    

    (Note: .group(0) is used to select the first match of re.search)

    Output:

         Merchant Merchant_Flag
    0     UBER SR          UBER
    1     SR UBER          UBER
    2   ZOMATO SR        ZOMATO
    3  12FLIPKART      FLIPKART
    

    EDIT

    In case you don't know all the merchants' names, you will need to modify a little bit the function passed to applymap:

    import re # import regex matching
    
    # Function definition
    def match_merchant(elem, reg):
        # Match 
        m = re.search(reg, elem)
        if m != None:
            # Return first match if we got a match
            return m.group(0)
        else:
            # Here you may specify exactly what it is you want to return: None, empty String...
            return None 
        
    df = pd.DataFrame({"Merchant":["UBER SR",
                                   "SR UBER",
                                   "ZOMATO SR",
                                   "12FLIPKART",
                                   "CABIFY"]})
    
    # Merchants you want to match
    regex = r"UBER|ZOMATO|FLIPKART"
    
    # Use applymap to quickly iterate over the whole dataframe
    df["Merchant_Flag"] = df.applymap(lambda x: match_merchant(x, regex))
    

    Output:

         Merchant Merchant_Flag
    0     UBER SR          UBER
    1     SR UBER          UBER
    2   ZOMATO SR        ZOMATO
    3  12FLIPKART      FLIPKART
    4      CABIFY          None