I am new to Pandas, and looking to sort a column containing strings and generate a numerical value to uniquely identify the string. My data frame looks something like this:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
First I like to sort the 'year_week'
column to arrange in ascending order (2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10)
and then generate a numerical value for each unique 'year_week'
string.
You can first convert to_datetime
column year_week
, then sort it by sort_values
and last use factorize
:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
key year_week date num
1 1 2015_1 2015-01-11 0
0 0 2015_10 2015-03-15 1
2 2 2015_11 2015-03-22 2
5 5 2016_3 2016-01-24 3
3 3 2016_9 2016-03-06 4
6 6 2016_9 2016-03-06 4
4 4 2016_10 2016-03-13 5
7 7 2016_10 2016-03-13 5