Search code examples
pythonsortingpandasdataframecategorical-data

Sort a column containing string in Pandas


I am new to Pandas, and looking to sort a column containing strings and generate a numerical value to uniquely identify the string. My data frame looks something like this:

df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})

First I like to sort the 'year_week' column to arrange in ascending order (2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10) and then generate a numerical value for each unique 'year_week' string.


Solution

  • You can first convert to_datetime column year_week, then sort it by sort_values and last use factorize:

    df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
    
    #http://stackoverflow.com/a/17087427/2901002
    df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
    #sort by column date
    df.sort_values('date', inplace=True)
    #create numerical values
    df['num'] = pd.factorize(df.year_week)[0]
    print (df)
       key year_week       date  num
    1    1    2015_1 2015-01-11    0
    0    0   2015_10 2015-03-15    1
    2    2   2015_11 2015-03-22    2
    5    5    2016_3 2016-01-24    3
    3    3    2016_9 2016-03-06    4
    6    6    2016_9 2016-03-06    4
    4    4   2016_10 2016-03-13    5
    7    7   2016_10 2016-03-13    5