Search code examples
pythonpandasnumpygroup-byunique

Group by pandas data frame unique first values - numpy array returned


From a two string columns pandas data frame looking like:

d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
     'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}

df = pd.DataFrame(d)

Notice the relationship between NAME to SCHOOL is n to 1. I want to get the last school in case one person has gone to two different schools (see "Will" case).

So far I got:

df = df.groupby('NAME')['SCHOOL'].unique().reset_index()

Return:

     NAME           SCHOOL
0    Alex            [LBS]
1    John           [Yale]
2    Marc           [Yale]
3  Miller        [Harvard]
4     Tom            [HEC]
5    Will  [Harvard, UCLA]

PROBLEMS:

  • unique() return both school not only the last school.
  • This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.

Solution

  • Both problems where solved based on @IanS comments.

    Using last() instead of unique():

    df = df.groupby('NAME')['SCHOOL'].last().reset_index()
    

    Return:

         NAME   SCHOOL
    0    Alex      LBS
    1    John     Yale
    2    Marc     Yale
    3  Miller  Harvard
    4     Tom      HEC
    5    Will     UCLA