Search code examples
pythondictionarygroup-bysplit

How to map a function over a dataframe field that is a list


I have spent hours troubleshooting this and really appreciate your help!

Each node can have multiple activities, and each activity can have up to two associated grades, which were an ARRAY type in SQL. My goal is to get the minimum activity grade for each node. I imported the ACTIVITY_GRADE field using ARRAY_TO_STRING(ACTIVITY_GRADE) AS ACTIVITY_GRADE (there might be a better way to import it to make it a list I can iterate over?)

The data looks like the below, and my goal is to get the last column as it is shown: min_node_grade.

STUDENT_ID RECORD_ID      NODE_NAME ACTIVITY_NAME  ACTIVITY_GRADE       GOAL = GET THIS min_node_grade
FredID     gobbledeegook1   Node1   MyActivity1    PreK, Kindergarten   PreK
FredID     gobbledeegook2   Node1   MyActivity1    Kindergarten         PreK
FredID     gobbledeegook3   Node2   MyActivity2    1st Grade            1st Grade
JaniceID   gobbledeegook4   Node3   MyActivity3    Kindergarten         Kindergarten
JaniceID   gobbledeegook5   Node3   MyActivity3    1st Grade            Kindergarten

I have gone a likely unnecessarily complicated route. How can I create a function that will map the grade_to_index dictionary across each value of the comma-separated string that is ACTIVITY_GRADE, take the minimum for that activity, then group by node and take the min over the node's activities?

#split it into two columns
df[['activity_grade_a', 'activity_grade_b']] = df.ACTIVITY_GRADE.str.split(",", expand = True)

#map it to integers so can take min to identify grade
grade_to_index = {"Preschool": -2, "Pre-K": -1, "Kindergarten": 0, "1st Grade": 1, '2nd Grade':2,'3rd Grade':3,'4th Grade':4,'5th Grade':5}

#map to invert the dictionary in order to get it back to text form
inv_map = {v: k for k, v in grade_to_index.items()}

#create columns with the index for the one or two grades. 
df['activity_grade_a_index']=df['activity_grade_a'].replace(grade_to_index)
df['activity_grade_b_index']=df['activity_grade_b'].replace(grade_to_index)

#get minimum of each row across the two columns; axis=1 says looks across columns
df['activity_min_grade_index'] = df[['activity_grade_a_index', 'activity_grade_b_index']].min(axis=1)

#group by node and get the minimum of activity-level minimums, map it to a new field
df['min_node_grade_index']=df.groupby('NODE_NAME')['activity_min_grade_index'].transform('min')

#get the grade back
df['min_node_grade']=df['min_node_grade_index'].replace(inv_map)`

Solution

  • If you had df as below,

    df = pd.DataFrame(
        columns=['STUDENT_ID', 'RECORD_ID', 'NODE_NAME', 'ACTIVITY_NAME', 'ACTIVITY_GRADE'],
        data=[  ['FredID', 'gobbledeegook1', 'Node1', 'MyActivity1', 'PreK, Kindergarten PreK'], 
                ['FredID', 'gobbledeegook2', 'Node1', 'MyActivity1', 'Kindergarten PreK'], 
                ['FredID', 'gobbledeegook3', 'Node2', 'MyActivity2', '1st Grade 1st Grade'], 
                ['JaniceID', 'gobbledeegook4', 'Node3', 'MyActivity3', 'Kindergarten Kindergarten'], 
                ['JaniceID', 'gobbledeegook5', 'Node3', 'MyActivity3', '1st Grade Preschool']    ]
    )  
    

    (I altered the last row so that it doesn't look like the added column only has the minimum for that row in the final output.)

    and a reference dictionary index_to_grade

    ## I changed "Pre-K" to "PreK" since that's how it seems to be in your dataset
    index_to_grade = {-2: 'Preschool', -1: 'PreK', 0: 'Kindergarten', 1: '1st Grade', 2: '2nd Grade', 3: '3rd Grade', 4: '4th Grade', 5: '5th Grade'}
    # grade_to_index = {g: i for i, g in index_to_grade.items()} ## not needed for my solution
    

    How can I create a function that will map the grade_to_index dictionary across each value of the comma-separated string that is ACTIVITY_GRADE, take the minimum for that activity, then group by node and take the min over the node's activities?

    You can write a function like

    def get_min_gi(grades:str, refDict=index_to_grade, defaultVal=None):
        inds = [i for i,g in refDict.items() if g in grades]
        return min(inds) if inds else defaultVal
    

    and .apply it to df['ACTIVITY_GRADE']

    df['min_grade'] = df['ACTIVITY_GRADE'].apply(get_min_gi) ## will be over-written
    min_node_grade_index = df.groupby('NODE_NAME')['min_grade'].transform('min')
    df['min_grade'] = min_node_grade_index.map(index_to_grade) ## over-write
    

    opdf



    About why your code might not have worked:

    grade_to_index = {"Preschool": -2, "Pre-K": -1, ....

    FredID gobbledeegook1 Node1 MyActivity1 PreK, Kindergarten PreK

    grade_to_index has key "Pre-K", but your sample dataset contains various instances of "PreK" (is it just a typo?)


    And also,

    df.ACTIVITY_GRADE.str.split(",", expand = True) - (view sample output)

    df[___].replace(grade_to_index) - (view sample output)

    The issue here is that not all you ACTIVITY_GRADE strings are properly comma-separated, so the split wouldn't work as intended; and even if they were comma separated, you should split by ', ' [at least for your sample data] because .replace will not work if there's extra whitespace around the keywords; and even then, you'd have more than 2 columns(view sample output) as a result (row 1 has 3 grades), so the df[['activity_grade_a', 'activity_grade_b']] = part would raise error.

    (That's why I prefer using a function like get_min_gi above since it should work for any number of grades.)


    ADDED EDIT

    desiredOp

    To get that, I would suggest something like

    # df['row_min_grade'] = df['ACTIVITY_GRADE'].apply(get_min_gi) # temporary 
    df['row_min_grade'] = [min([grade_to_index.get(v,v) for v in i]) for i in df['ACTIVITY_GRADE'].str.split(","‌​)] # temporary 
    df['min_node_grade'] = df.groupby('NODE_NAME')['row_min_grade'].transform('min').map(inv_map) 
    df['min_activity_grade'] = df.groupby('ACTIVITY_NAME')['row_min_grade'].transform('min').map(inv_map)
    df.drop(['row_min_grade'], axis='columns') # drop temporary column