I have a DF which is quite big. A snippet like the one shown below.
SrNo | Merchant | Revenue | Currency
1 | UBER SR | 123 | INR
2 | UBER (SR)| 123 | INR
3 | SR UBER | 123 | INR
4 | ZOMATO SR| 123 | INR
5 | ZOMATOSR | 123 | INR
6 |12FLIPAKRT| 123 | INR
7 | FLIPKART | 123 | INR
My Output should look like:
SrNo | Merchant | Revenue | Currency |Merchant_Flag
1 | UBER SR | 123 | INR | UBER
2 | UBER (SR)| 123 | INR | UBER
3 | SR UBER | 123 | INR | UBER
4 | ZOMATO SR| 123 | INR | ZOMATO
5 | ZOMATOSR | 123 | INR | ZOMATO
6 |12FLIPAKRT| 123 | INR | FLIPKART
7 | FLIPKART | 123 | INR | FLIPKART
Explanation : I want to add an additional column which should have values wrt to Merchant column i.e. if the Merchant column value has UBER in it, Merchant_Flag should be UBER and likewise for other ZOMATO, FLIPKART.
My Dataset is huge. I tried using re.search and then .replace using if and else for my conditions, it is giving me performance issue. Another solution, I tried was using .loc
df.loc[df['columnname'].str.contains('')]
. Not sure how to proceed. Can someone help on this.
I would use the pandas.DataFrame.applymap
method, it is always (as a rule of thumb) better in terms of performance to use built-in methods when performing iterative operations over an object instead of coding a plain for/while loop in python. This way you get rid of the overhead of calling the pandas.DataFrame.loc
accessor multiple times.
import re # import regex matching
df = pd.DataFrame({"Merchant":["UBER SR", "SR UBER", "ZOMATO SR", "12FLIPKART"]})
# Use applymap to quickly iterate over the whole dataframe
df["Merchant_Flag"] = df.applymap(lambda x: re.search(r"UBER|ZOMATO|FLIPKART", x).group(0))
(Note: .group(0)
is used to select the first match of re.search
)
Output:
Merchant Merchant_Flag
0 UBER SR UBER
1 SR UBER UBER
2 ZOMATO SR ZOMATO
3 12FLIPKART FLIPKART
In case you don't know all the merchants' names, you will need to modify a little bit the function passed to applymap
:
import re # import regex matching
# Function definition
def match_merchant(elem, reg):
# Match
m = re.search(reg, elem)
if m != None:
# Return first match if we got a match
return m.group(0)
else:
# Here you may specify exactly what it is you want to return: None, empty String...
return None
df = pd.DataFrame({"Merchant":["UBER SR",
"SR UBER",
"ZOMATO SR",
"12FLIPKART",
"CABIFY"]})
# Merchants you want to match
regex = r"UBER|ZOMATO|FLIPKART"
# Use applymap to quickly iterate over the whole dataframe
df["Merchant_Flag"] = df.applymap(lambda x: match_merchant(x, regex))
Output:
Merchant Merchant_Flag
0 UBER SR UBER
1 SR UBER UBER
2 ZOMATO SR ZOMATO
3 12FLIPKART FLIPKART
4 CABIFY None