I am trying to solve below problem:
Write a program to access the university graduate data from data.gov.sg. The link to the web page is as follows:
https://data.gov.sg/api/action/datastore_search?resource_id=eb8b932c-503c-41e7-b513-114cffbe2338
Using this data, compute for each of the years, which are the top 3 course type for males and females.
A sample of the expected output is as follows:
1993
Males: Engineering Sciences | Humanities & Social Sciences | Natural, Physical & Mathematical Sciences
Females: Humanities & Social Sciences | Business & Administration | Natural, Physical & Mathematical Sciences
but not sure - how to group the data and get top 3 courses year wise. I have worked till the last step, except the grouping part
My Code:
import requests
import pprint
pp = pprint.PrettyPrinter(indent=4)
res = requests.get("https://data.gov.sg/api/action/datastore_search?resource_id=eb8b932c-503c-41e7-b513-114cffbe2338&limit=100")
obj = res.json()
print(pp.pprint(obj))
for record in obj["result"]["records"]:
print(record["year"], ' | ', record["sex"] , ' ', record ["type_of_course"])
Create a Pandas DataFrame from the records
key of your result
dict from the response JSON and groupby
"year" and "sex" after sort_values
on "no_of_graduates"
Example code:
import pandas as pd
import requests
res = requests.get("https://data.gov.sg/api/action/datastore_search?resource_id=eb8b932c-503c-41e7-b513-114cffbe2338&limit=100").json()['result']
df = pd.DataFrame(res['records'])
def convert_to_int(s: str) -> int:
try:
return int(s)
except ValueError:
return 0
df['no_of_graduates'] = df['no_of_graduates'].apply(convert_to_int)
males = df[df['sex'] == 'Males']
females = df.drop(males.index)
males = males.sort_values('no_of_graduates', ascending=False).groupby('year').head(3).sort_values('year')
females = females.sort_values('no_of_graduates', ascending=False).groupby('year').head(3).sort_values('year')
result = pd.concat([males, females])
Result DF
_id sex no_of_graduates type_of_course year
13 14 Males 1496 Engineering Sciences 1993
2 3 Males 481 Humanities & Social Sciences 1993
7 8 Males 404 Natural, Physical & Mathematical Sciences 1993
43 44 Males 1666 Engineering Sciences 1994
32 33 Males 512 Humanities & Social Sciences 1994
35 36 Males 413 Business & Administration 1994
73 74 Males 1715 Engineering Sciences 1995
62 63 Males 497 Humanities & Social Sciences 1995
67 68 Males 460 Natural, Physical & Mathematical Sciences 1995
92 93 Males 497 Humanities & Social Sciences 1996
97 98 Males 449 Natural, Physical & Mathematical Sciences 1996
95 96 Males 358 Business & Administration 1996
17 18 Females 1173 Humanities & Social Sciences 1993
20 21 Females 708 Business & Administration 1993
22 23 Females 588 Natural, Physical & Mathematical Sciences 1993
47 48 Females 1133 Humanities & Social Sciences 1994
50 51 Females 733 Business & Administration 1994
52 53 Females 566 Natural, Physical & Mathematical Sciences 1994
77 78 Females 1240 Humanities & Social Sciences 1995
80 81 Females 788 Business & Administration 1995
82 83 Females 572 Natural, Physical & Mathematical Sciences 1995
You can then format output as required from the result
dataframe.