Search code examples
pythonpandaspanel-data

Pandas dataframe change values in a column based on conditions


I have a large Dataframe below:

The data used as the example here 'education_val.csv' can be found here https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv

import pandas as pd 

edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)

ID  Year    Education
22445   1991    higher education
29925   1991    No qualifications
76165   1991    No qualifications
223725  1991    Other
280165  1991    intermediate qualifications
333205  1991    No qualifications
387605  1991    higher education
541285  1991    No qualifications
541965  1991    No qualifications
599765  1991    No qualifications

The values in the column Education are:

edu.Education.value_counts()

intermediate qualifications 153705
higher education    67020
No qualifications   55842
Other   36915

I want to replace the values in the column Education in the following ways:

  1. If an ID has the value higher education in a year in the column Education then all future years for that ID will also have higher education in the Education column.

  2. If an ID has the value intermediate qualifications in a year, then all future years for that ID will have intermediate qualifications in the corresponding Education column. However, if the value higher education occurs in any of the subsequent years for this ID, then higher education replaces intermediate qualifications in the subsequent years, regardless if Other or No qualifications occur.

For example in the DataFrame below, ID 22445 has the valuehigher education in the year 1991, all subsequent values of Education for 22445 should be replaced with higher education in the later years, up to the year 2017.

edu.loc[edu['ID'] == 22445]

ID  Year    Education
22445   1991    higher education
22445   1992    higher education
22445   1993    higher education
22445   1994    higher education
22445   1995    higher education
22445   1996    intermediate qualifications
22445   1997    intermediate qualifications
22445   1998    Other
22445   1999    No qualifications
22445   2000    intermediate qualifications
22445   2001    intermediate qualifications
22445   2002    intermediate qualifications
22445   2003    intermediate qualifications
22445   2004    intermediate qualifications
22445   2005    intermediate qualifications
22445   2006    intermediate qualifications
22445   2007    intermediate qualifications
22445   2008    intermediate qualifications
22445   2010    intermediate qualifications
22445   2011    intermediate qualifications
22445   2012    intermediate qualifications
22445   2013    intermediate qualifications
22445   2014    intermediate qualifications
22445   2015    intermediate qualifications
22445   2016    intermediate qualifications
22445   2017    intermediate qualifications

Similarly, ID 1587125 in the Dataframe below has the value intermediate qualifications in the year 1991, and changes to higher education in 1993. All subsequent values in the column Education in the future years (from 1993 onwards) for 1587125 should be higher education.

edu.loc[edu['ID'] == 1587125]

ID  Year    Education
1587125 1991    intermediate qualifications
1587125 1992    intermediate qualifications
1587125 1993    higher education
1587125 1994    higher education
1587125 1995    higher education
1587125 1996    higher education
1587125 1997    higher education
1587125 1998    higher education
1587125 1999    higher education
1587125 2000    higher education
1587125 2001    higher education
1587125 2002    higher education
1587125 2003    higher education
1587125 2004    Other
1587125 2005    No qualifications
1587125 2006    intermediate qualifications
1587125 2007    intermediate qualifications
1587125 2008    intermediate qualifications
1587125 2010    intermediate qualifications
1587125 2011    higher education
1587125 2012    higher education
1587125 2013    higher education
1587125 2014    higher education
1587125 2015    higher education
1587125 2016    higher education
1587125 2017    higher education

There are 12,057 unique ID in the data and the column Year spans from 1991 to 2017. How does one change the values of Education for all 12, 057 according to the above conditions? I'm not sure how to do this in a uniform way for all unique IDs. The sample data used as the example here is attached in the Github link above. Many thanks in advance.


Solution

  • You can do it using the categorical data like this:

    df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
    
    eddtype = pd.CategoricalDtype(['No qualifications', 
                                   'Other',
                                   'intermediate qualifications',
                                   'higher education'], 
                                   ordered=True)
    df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
    
    df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                     .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
    

    It is broken it up explicitly so you can see the data manipulations I am using.

    1. Create a Education categorical dtype with order
    2. Next, change dtype of Education column to use that categorical dtype (EducationCat)
    3. Use the codes of the categorical to preform cummax calculation
    4. With indexing to return the category defined by the cummax calculation (EduMax)

    Outputs:

    df[df['ID'] == 1587125]
    
                ID  Year                    Education                 EducationCat                       EduMax
    18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
    12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
    24132   1587125  1993             higher education             higher education             higher education
    36189   1587125  1994             higher education             higher education             higher education
    48246   1587125  1995             higher education             higher education             higher education
    60303   1587125  1996             higher education             higher education             higher education
    72360   1587125  1997             higher education             higher education             higher education
    84417   1587125  1998             higher education             higher education             higher education
    96474   1587125  1999             higher education             higher education             higher education
    108531  1587125  2000             higher education             higher education             higher education
    120588  1587125  2001             higher education             higher education             higher education
    132645  1587125  2002             higher education             higher education             higher education
    144702  1587125  2003             higher education             higher education             higher education
    156759  1587125  2004                        Other                        Other             higher education
    168816  1587125  2005            No qualifications            No qualifications             higher education
    180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
    192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
    204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
    217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
    229101  1587125  2011             higher education             higher education             higher education
    241158  1587125  2012             higher education             higher education             higher education
    253215  1587125  2013             higher education             higher education             higher education
    265272  1587125  2014             higher education             higher education             higher education
    277329  1587125  2015             higher education             higher education             higher education
    289386  1587125  2016             higher education             higher education             higher education
    301443  1587125  2017             higher education             higher education             higher education