Search code examples
pythonpandasnumpydata-analysis

Obtaining number of occurrences of a variable from a column in python using pandas


I am in the learning phase of analyzing data using python, stumbled upon a doubt. Consider the following data set:

print (df)
         CITY                 OCCUPATION
0   BANGALORE        MECHANICAL ENGINEER
1   BANGALORE  COMPUTER SCIENCE ENGINEER
2   BANGALORE        MECHANICAL ENGINEER
3   BANGALORE  COMPUTER SCIENCE ENGINEER
4   BANGALORE  COMPUTER SCIENCE ENGINEER
5      MUMBAI                      ACTOR
6      MUMBAI                      ACTOR
7      MUMBAI               SHARE BROKER
8      MUMBAI               SHARE BROKER
9      MUMBAI                      ACTOR
10    CHENNAI                    RETIRED
11    CHENNAI             LAND DEVELOPER
12    CHENNAI        MECHANICAL ENGINEER
13    CHENNAI        MECHANICAL ENGINEER
14    CHENNAI        MECHANICAL ENGINEER
15      DELHI                  PHYSICIAN
16      DELHI                  PHYSICIAN
17      DELHI                 JOURNALIST
18      DELHI                 JOURNALIST
19      DELHI                      ACTOR
20       PUNE                    MANAGER
21       PUNE                    MANAGER
22       PUNE                    MANAGER

how to get the maximum number of jobs from a particular state using pandas. eg:

STATE OCCUPATION
----------------

BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------

MUMBAI - ACTOR
------------

Solution

  • First solution is groupby with Counter and most_common:

    For DELHI is same number 2 for JOURNALIST and PHYSICIAN, so difference in output of solutions.

    from collections import Counter
    
    df1 = df.groupby('CITY').OCCUPATION
            .apply(lambda x: Counter(x).most_common(1)[0][0])
            .reset_index()
    print (df1)
            CITY                 OCCUPATION
    0  BANGALORE  COMPUTER SCIENCE ENGINEER
    1    CHENNAI        MECHANICAL ENGINEER
    2      DELHI                  PHYSICIAN
    3     MUMBAI                      ACTOR
    4       PUNE                    MANAGER
    

    Another solution with groupby, size and nlargest:

    df1 = df.groupby(['CITY', 'OCCUPATION'])
            .size()
            .groupby(level=0)
            .nlargest(1)
            .reset_index(level=0,drop=True)
            .reset_index(name='a')
            .drop('a', axis=1)
    print (df1)
            CITY                 OCCUPATION
    0  BANGALORE  COMPUTER SCIENCE ENGINEER
    1    CHENNAI        MECHANICAL ENGINEER
    2      DELHI                 JOURNALIST
    3     MUMBAI                      ACTOR
    4       PUNE                    MANAGER
    

    EDIT:

    For debugging here is the best custom function what is same as lambda function:

    from collections import Counter
    
    def f(x):
        #print Series  
        print (x)
        #count values by Counter
        print (Counter(x).most_common())
        #get first top value - list ogf tuple
        print (Counter(x).most_common(1))
        #select list by indexing [0] - output is tuple
        print (Counter(x).most_common(1)[0])
        #select first value of tuple by another [0]
        #for selecting count use [1] instead [0]
        print (Counter(x).most_common(1)[0][0])
        return Counter(x).most_common(1)[0][0]
    
    df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
    print (df1)
            CITY                 OCCUPATION
    0  BANGALORE  COMPUTER SCIENCE ENGINEER
    1    CHENNAI        MECHANICAL ENGINEER
    2      DELHI                 JOURNALIST
    3     MUMBAI                      ACTOR
    4       PUNE                    MANAGER