I am a Stata user and trying to replicate some code in Python.
More specifically, I want to create a new column which is called port
.
In Stata my code to achieve the desired output is:
egen port = xtile(marketcap), nquantiles(10) by(date)
The xtile()
function is part of the community-contributed package egenmore
.
The above egen
command is generating a variable called port
, which returns a number from 1
to 10
for each observation that is dependent on the marketcap
variable for each date.
The xtile
function defines the variable we are utilising (marketcap
), while the nquantiles
option splits the data into terciles. The tickers whose marketcap is within the top tercile is coded as a 1
, second highest tercile as a 2
and so on, until the tickers whose marketcap is in the lowest tercile is coded as a 10
.
The result is a number in the port
variable for each ticker on each date. This occurs as companies marketcap change between months and therefore will be situated in different terciles in different months.
I use monthly time-series data (10+ years), but have included data for 10 tickers over 2 months:
df.head()
date ticker return marketcap
2004m1 A 0.02500 2.500117e+10
2004m1 B 0.04000 3.600111e+10
2004m1 C -0.01500 4.900222e+11
2004m1 D -0.02500 1.400134e+10
2004m1 E -0.04000 3.200288e+10
2004m1 F -0.01200 9.300110e+10
2004m1 G 0.03500 5.500512e+12
2004m1 H 0.04600 2.100177e+11
2004m1 I 0.00500 3.300155e+10
2004m1 J 0.00750 2.400999e+13
2004m1 K 0.02500 7.700155e+10
2004m2 A 0.03500 2.600118e+10
2004m2 B 0.02000 3.300333e+10
2004m2 C -0.00500 4.500477e+11
2004m2 D 0.01100 1.800155e+10
2004m2 E -0.02000 3.500288e+10
2004m2 F 0.01200 3.001109e+11
2004m2 G 0.04200 5.700442e+12
2004m2 H 0.01100 2.800888e+11
2004m2 I 0.01900 3.600115e+10
2004m2 J 0.00550 2.100333e+13
2004m2 K -0.01500 5.001550e+10
You need to use the pandas.qcut()
function:
In [1]: df['port'] = df.groupby(['date'])['marketcap'].transform(
lambda x: pd.qcut(x, 10, labels = False))
In [2]: df
Out[2]:
date ticker return marketcap port
0 2004m1 A 0.0250 2.500117e+10 0.0
1 2004m1 B 0.0400 3.600111e+10 3.0
2 2004m1 C -0.0150 4.900222e+11 7.0
3 2004m1 D -0.0250 1.400134e+10 0.0
4 2004m1 E -0.0400 3.200288e+10 1.0
5 2004m1 F -0.0120 9.300110e+10 5.0
6 2004m1 G 0.0350 5.500512e+12 8.0
7 2004m1 H 0.0460 2.100177e+11 6.0
8 2004m1 I 0.0050 3.300155e+10 2.0
9 2004m1 J 0.0075 2.400999e+13 9.0
10 2004m1 K 0.0250 7.700155e+10 4.0
11 2004m2 A 0.0350 2.600118e+10 0.0
12 2004m2 B 0.0200 3.300333e+10 1.0
13 2004m2 C -0.0050 4.500477e+11 7.0
14 2004m2 D 0.0110 1.800155e+10 0.0
15 2004m2 E -0.0200 3.500288e+10 2.0
16 2004m2 F 0.0120 3.001109e+11 6.0
17 2004m2 G 0.0420 5.700442e+12 8.0
18 2004m2 H 0.0110 2.800888e+11 5.0
19 2004m2 I 0.0190 3.600115e+10 3.0
20 2004m2 J 0.0055 2.100333e+13 9.0
21 2004m2 K -0.0150 5.001550e+10 4.0
Using your toy example in Stata:
clear
input str7 date str1 ticker return marketcap
"01/2004" "A" 0.02500 2.500117e+10
"01/2004" "B" 0.04000 3.600111e+10
"01/2004" "C" -0.01500 4.900222e+11
"01/2004" "D" -0.02500 1.400134e+10
"01/2004" "E" -0.04000 3.200288e+10
"01/2004" "F" -0.01200 9.300110e+10
"01/2004" "G" 0.03500 5.500512e+12
"01/2004" "H" 0.04600 2.100177e+11
"01/2004" "I" 0.00500 3.300155e+10
"01/2004" "J" 0.00750 2.400999e+13
"01/2004" "K" 0.02500 7.700155e+10
"02/2004" "A" 0.03500 2.600118e+10
"02/2004" "B" 0.02000 3.300333e+10
"02/2004" "C" -0.00500 4.500477e+11
"02/2004" "D" 0.01100 1.800155e+10
"02/2004" "E" -0.02000 3.500288e+10
"02/2004" "F" 0.01200 3.001109e+11
"02/2004" "G" 0.04200 5.700442e+12
"02/2004" "H" 0.01100 2.800888e+11
"02/2004" "I" 0.01900 3.600115e+10
"02/2004" "J" 0.00550 2.100333e+13
"02/2004" "K" -0.01500 5.001550e+10
end
The command egen
produces equivalent results:
egen port = xtile(marketcap), nquantiles(10) by(date)
list, sepby(date)
+---------------------------------------------+
| date ticker return market~p port |
|---------------------------------------------|
1. | 01/2004 A .025 2.50e+10 1 |
2. | 01/2004 B .04 3.60e+10 4 |
3. | 01/2004 C -.015 4.90e+11 8 |
4. | 01/2004 D -.025 1.40e+10 1 |
5. | 01/2004 E -.04 3.20e+10 2 |
6. | 01/2004 F -.012 9.30e+10 6 |
7. | 01/2004 G .035 5.50e+12 9 |
8. | 01/2004 H .046 2.10e+11 7 |
9. | 01/2004 I .005 3.30e+10 3 |
10. | 01/2004 J .0075 2.40e+13 10 |
11. | 01/2004 K .025 7.70e+10 5 |
|---------------------------------------------|
12. | 02/2004 A .035 2.60e+10 1 |
13. | 02/2004 B .02 3.30e+10 2 |
14. | 02/2004 C -.005 4.50e+11 8 |
15. | 02/2004 D .011 1.80e+10 1 |
16. | 02/2004 E -.02 3.50e+10 3 |
17. | 02/2004 F .012 3.00e+11 7 |
18. | 02/2004 G .042 5.70e+12 9 |
19. | 02/2004 H .011 2.80e+11 6 |
20. | 02/2004 I .019 3.60e+10 4 |
21. | 02/2004 J .0055 2.10e+13 10 |
22. | 02/2004 K -.015 5.00e+10 5 |
+---------------------------------------------+
The differences in the results are attributed to the fact that indexing starts with one in Stata.