I am in the learning phase of analyzing data using python, stumbled upon a doubt. Consider the following data set:
print (df)
CITY OCCUPATION
0 BANGALORE MECHANICAL ENGINEER
1 BANGALORE COMPUTER SCIENCE ENGINEER
2 BANGALORE MECHANICAL ENGINEER
3 BANGALORE COMPUTER SCIENCE ENGINEER
4 BANGALORE COMPUTER SCIENCE ENGINEER
5 MUMBAI ACTOR
6 MUMBAI ACTOR
7 MUMBAI SHARE BROKER
8 MUMBAI SHARE BROKER
9 MUMBAI ACTOR
10 CHENNAI RETIRED
11 CHENNAI LAND DEVELOPER
12 CHENNAI MECHANICAL ENGINEER
13 CHENNAI MECHANICAL ENGINEER
14 CHENNAI MECHANICAL ENGINEER
15 DELHI PHYSICIAN
16 DELHI PHYSICIAN
17 DELHI JOURNALIST
18 DELHI JOURNALIST
19 DELHI ACTOR
20 PUNE MANAGER
21 PUNE MANAGER
22 PUNE MANAGER
how to get the maximum number of jobs from a particular state using pandas. eg:
STATE OCCUPATION
----------------
BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------
MUMBAI - ACTOR
------------
First solution is groupby
with Counter
and most_common
:
For DELHI
is same number 2
for JOURNALIST
and PHYSICIAN
, so difference in output of solutions.
from collections import Counter
df1 = df.groupby('CITY').OCCUPATION
.apply(lambda x: Counter(x).most_common(1)[0][0])
.reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI PHYSICIAN
3 MUMBAI ACTOR
4 PUNE MANAGER
Another solution with groupby
, size
and nlargest
:
df1 = df.groupby(['CITY', 'OCCUPATION'])
.size()
.groupby(level=0)
.nlargest(1)
.reset_index(level=0,drop=True)
.reset_index(name='a')
.drop('a', axis=1)
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER
EDIT:
For debugging here is the best custom function what is same as lambda function:
from collections import Counter
def f(x):
#print Series
print (x)
#count values by Counter
print (Counter(x).most_common())
#get first top value - list ogf tuple
print (Counter(x).most_common(1))
#select list by indexing [0] - output is tuple
print (Counter(x).most_common(1)[0])
#select first value of tuple by another [0]
#for selecting count use [1] instead [0]
print (Counter(x).most_common(1)[0][0])
return Counter(x).most_common(1)[0][0]
df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER