I have spent hours troubleshooting this and really appreciate your help!
Each node can have multiple activities, and each activity can have up to two associated grades, which were an ARRAY type in SQL. My goal is to get the minimum activity grade for each node. I imported the ACTIVITY_GRADE field using ARRAY_TO_STRING(ACTIVITY_GRADE) AS ACTIVITY_GRADE (there might be a better way to import it to make it a list I can iterate over?)
The data looks like the below, and my goal is to get the last column as it is shown: min_node_grade.
STUDENT_ID RECORD_ID NODE_NAME ACTIVITY_NAME ACTIVITY_GRADE GOAL = GET THIS min_node_grade FredID gobbledeegook1 Node1 MyActivity1 PreK, Kindergarten PreK FredID gobbledeegook2 Node1 MyActivity1 Kindergarten PreK FredID gobbledeegook3 Node2 MyActivity2 1st Grade 1st Grade JaniceID gobbledeegook4 Node3 MyActivity3 Kindergarten Kindergarten JaniceID gobbledeegook5 Node3 MyActivity3 1st Grade Kindergarten
I have gone a likely unnecessarily complicated route. How can I create a function that will map the grade_to_index dictionary across each value of the comma-separated string that is ACTIVITY_GRADE, take the minimum for that activity, then group by node and take the min over the node's activities?
#split it into two columns
df[['activity_grade_a', 'activity_grade_b']] = df.ACTIVITY_GRADE.str.split(",", expand = True)
#map it to integers so can take min to identify grade
grade_to_index = {"Preschool": -2, "Pre-K": -1, "Kindergarten": 0, "1st Grade": 1, '2nd Grade':2,'3rd Grade':3,'4th Grade':4,'5th Grade':5}
#map to invert the dictionary in order to get it back to text form
inv_map = {v: k for k, v in grade_to_index.items()}
#create columns with the index for the one or two grades.
df['activity_grade_a_index']=df['activity_grade_a'].replace(grade_to_index)
df['activity_grade_b_index']=df['activity_grade_b'].replace(grade_to_index)
#get minimum of each row across the two columns; axis=1 says looks across columns
df['activity_min_grade_index'] = df[['activity_grade_a_index', 'activity_grade_b_index']].min(axis=1)
#group by node and get the minimum of activity-level minimums, map it to a new field
df['min_node_grade_index']=df.groupby('NODE_NAME')['activity_min_grade_index'].transform('min')
#get the grade back
df['min_node_grade']=df['min_node_grade_index'].replace(inv_map)`
If you had df
as below,
df = pd.DataFrame(
columns=['STUDENT_ID', 'RECORD_ID', 'NODE_NAME', 'ACTIVITY_NAME', 'ACTIVITY_GRADE'],
data=[ ['FredID', 'gobbledeegook1', 'Node1', 'MyActivity1', 'PreK, Kindergarten PreK'],
['FredID', 'gobbledeegook2', 'Node1', 'MyActivity1', 'Kindergarten PreK'],
['FredID', 'gobbledeegook3', 'Node2', 'MyActivity2', '1st Grade 1st Grade'],
['JaniceID', 'gobbledeegook4', 'Node3', 'MyActivity3', 'Kindergarten Kindergarten'],
['JaniceID', 'gobbledeegook5', 'Node3', 'MyActivity3', '1st Grade Preschool'] ]
)
(I altered the last row so that it doesn't look like the added column only has the minimum for that row in the final output.)
and a reference dictionary index_to_grade
## I changed "Pre-K" to "PreK" since that's how it seems to be in your dataset
index_to_grade = {-2: 'Preschool', -1: 'PreK', 0: 'Kindergarten', 1: '1st Grade', 2: '2nd Grade', 3: '3rd Grade', 4: '4th Grade', 5: '5th Grade'}
# grade_to_index = {g: i for i, g in index_to_grade.items()} ## not needed for my solution
How can I create a function that will map the grade_to_index dictionary across each value of the comma-separated string that is ACTIVITY_GRADE, take the minimum for that activity, then group by node and take the min over the node's activities?
You can write a function like
def get_min_gi(grades:str, refDict=index_to_grade, defaultVal=None):
inds = [i for i,g in refDict.items() if g in grades]
return min(inds) if inds else defaultVal
and .apply
it to df['ACTIVITY_GRADE']
df['min_grade'] = df['ACTIVITY_GRADE'].apply(get_min_gi) ## will be over-written
min_node_grade_index = df.groupby('NODE_NAME')['min_grade'].transform('min')
df['min_grade'] = min_node_grade_index.map(index_to_grade) ## over-write
About why your code might not have worked:
grade_to_index = {"Preschool": -2, "Pre-K": -1, ....
FredID gobbledeegook1 Node1 MyActivity1 PreK, Kindergarten PreK
grade_to_index
has key "Pre-K", but your sample dataset contains various instances of "PreK" (is it just a typo?)
And also,
df.ACTIVITY_GRADE.str.split(",", expand = True)
- (view sample output)
df[___].replace(grade_to_index)
- (view sample output)
The issue here is that not all you ACTIVITY_GRADE
strings are properly comma-separated, so the split
wouldn't work as intended; and even if they were comma separated, you should split by ', '
[at least for your sample data] because .replace
will not work if there's extra whitespace around the keywords; and even then, you'd have more than 2 columns(view sample output) as a result (row 1 has 3 grades), so the df[['activity_grade_a', 'activity_grade_b']] =
part would raise error.
(That's why I prefer using a function like get_min_gi
above since it should work for any number of grades.)
To get that, I would suggest something like
# df['row_min_grade'] = df['ACTIVITY_GRADE'].apply(get_min_gi) # temporary
df['row_min_grade'] = [min([grade_to_index.get(v,v) for v in i]) for i in df['ACTIVITY_GRADE'].str.split(",")] # temporary
df['min_node_grade'] = df.groupby('NODE_NAME')['row_min_grade'].transform('min').map(inv_map)
df['min_activity_grade'] = df.groupby('ACTIVITY_NAME')['row_min_grade'].transform('min').map(inv_map)
df.drop(['row_min_grade'], axis='columns') # drop temporary column