I am new to Pandas. I am trying to understand how to do something in pandas, that I do in SQL -
I have a table like -
Account Company Blotter
112233 10 62
233445 12 62
233445 10 66
343454 21 66
343454 21 64
768876 25 54
In SQL, if a given account appears in multiple rows I would use rank() and if I want to give a preference to a certain company, I would put a case statement to force that company to be prioritized. I can also use Blotter column as an additional rank parameter. e.g.
rank() over(
partition by ACCOUNT
order by case
when COMPANY='12' then 0
when COMPANY='21' then 1
else COMPANY
end,
case
when BLOTTER ='66' then 0
else BLOTTER
end
)
expected output:
Account Company Blotter rank
0 112233 10 62 1
1 233445 12 62 1
2 233445 10 66 2
3 343454 21 66 1
4 343454 21 64 2
5 768876 25 54 1
You may want to try this:
# recompute the sort criteria for company and blotter
ser_sort_company= df['Company'].map({12: 0, 21: 1}).fillna(df['Company'])
ser_sort_blotter= df['Blotter'].map({12: 0, 21: 1}).fillna(df['Blotter'])
df['rank']= (df
# temporarily create sort columns
.assign(sort_company=ser_sort_company)
.assign(sort_blotter=ser_sort_blotter)
# temporarily sort the result
# this replaces the ORDER BY part
.sort_values(['sort_company', 'sort_blotter'])
# group by Account to replace the PARTITION BY part
.groupby('Account')
# get the position of the record in the group (RANK part)
.transform('cumcount') + 1
)
df
It evaluates to:
Account Company Blotter rank
0 112233 10 62 1
1 233445 12 62 1
2 233445 10 66 2
3 343454 21 66 2
4 343454 21 64 1
5 768876 25 54 1