Search code examples
pythongrouping

Facing issue in grouping the data


I am trying to solve below problem:

Write a program to access the university graduate data from data.gov.sg. The link to the web page is as follows:

https://data.gov.sg/api/action/datastore_search?resource_id=eb8b932c-503c-41e7-b513-114cffbe2338

Using this data, compute for each of the years, which are the top 3 course type for males and females.

A sample of the expected output is as follows:

1993
Males: Engineering Sciences | Humanities & Social Sciences | Natural, Physical & Mathematical Sciences
Females: Humanities & Social Sciences | Business & Administration | Natural, Physical & Mathematical Sciences

but not sure - how to group the data and get top 3 courses year wise. I have worked till the last step, except the grouping part

My Code:

import requests
import pprint

pp = pprint.PrettyPrinter(indent=4)

res = requests.get("https://data.gov.sg/api/action/datastore_search?resource_id=eb8b932c-503c-41e7-b513-114cffbe2338&limit=100")

obj = res.json()

print(pp.pprint(obj))

for record in obj["result"]["records"]:
    print(record["year"], ' | ', record["sex"] , ' ', record ["type_of_course"])

Solution

  • Create a Pandas DataFrame from the records key of your result dict from the response JSON and groupby "year" and "sex" after sort_values on "no_of_graduates"

    Example code:

    import pandas as pd
    import requests
    
    res = requests.get("https://data.gov.sg/api/action/datastore_search?resource_id=eb8b932c-503c-41e7-b513-114cffbe2338&limit=100").json()['result']
    
    df = pd.DataFrame(res['records'])
    
    def convert_to_int(s: str) -> int:
        try:
            return int(s)
        except ValueError:
            return 0
    
    df['no_of_graduates'] = df['no_of_graduates'].apply(convert_to_int)
    
    males = df[df['sex'] == 'Males']
    females = df.drop(males.index)
    
    males = males.sort_values('no_of_graduates', ascending=False).groupby('year').head(3).sort_values('year')
    females = females.sort_values('no_of_graduates', ascending=False).groupby('year').head(3).sort_values('year')
    
    result = pd.concat([males, females])
    

    Result DF

        _id sex no_of_graduates type_of_course  year
    13  14  Males   1496    Engineering Sciences    1993
    2   3   Males   481 Humanities & Social Sciences    1993
    7   8   Males   404 Natural, Physical & Mathematical Sciences   1993
    43  44  Males   1666    Engineering Sciences    1994
    32  33  Males   512 Humanities & Social Sciences    1994
    35  36  Males   413 Business & Administration   1994
    73  74  Males   1715    Engineering Sciences    1995
    62  63  Males   497 Humanities & Social Sciences    1995
    67  68  Males   460 Natural, Physical & Mathematical Sciences   1995
    92  93  Males   497 Humanities & Social Sciences    1996
    97  98  Males   449 Natural, Physical & Mathematical Sciences   1996
    95  96  Males   358 Business & Administration   1996
    17  18  Females 1173    Humanities & Social Sciences    1993
    20  21  Females 708 Business & Administration   1993
    22  23  Females 588 Natural, Physical & Mathematical Sciences   1993
    47  48  Females 1133    Humanities & Social Sciences    1994
    50  51  Females 733 Business & Administration   1994
    52  53  Females 566 Natural, Physical & Mathematical Sciences   1994
    77  78  Females 1240    Humanities & Social Sciences    1995
    80  81  Females 788 Business & Administration   1995
    82  83  Females 572 Natural, Physical & Mathematical Sciences   1995
    

    You can then format output as required from the result dataframe.