Search code examples
pythonpython-3.xpandasstataquantitative-finance

Function in Python similar to egenmore's xtile() in Stata


I am a Stata user and trying to replicate some code in Python.

More specifically, I want to create a new column which is called port.

In Stata my code to achieve the desired output is:

egen port = xtile(marketcap), nquantiles(10) by(date) 

The xtile() function is part of the community-contributed package egenmore.

The above egen command is generating a variable called port, which returns a number from 1 to 10 for each observation that is dependent on the marketcap variable for each date.

The xtile function defines the variable we are utilising (marketcap), while the nquantiles option splits the data into terciles. The tickers whose marketcap is within the top tercile is coded as a 1, second highest tercile as a 2 and so on, until the tickers whose marketcap is in the lowest tercile is coded as a 10. The result is a number in the port variable for each ticker on each date. This occurs as companies marketcap change between months and therefore will be situated in different terciles in different months.

I use monthly time-series data (10+ years), but have included data for 10 tickers over 2 months:

df.head()

date     ticker   return      marketcap
2004m1     A      0.02500   2.500117e+10
2004m1     B      0.04000   3.600111e+10
2004m1     C     -0.01500   4.900222e+11
2004m1     D     -0.02500   1.400134e+10
2004m1     E     -0.04000   3.200288e+10
2004m1     F     -0.01200   9.300110e+10
2004m1     G      0.03500   5.500512e+12
2004m1     H      0.04600   2.100177e+11
2004m1     I      0.00500   3.300155e+10
2004m1     J      0.00750   2.400999e+13
2004m1     K      0.02500   7.700155e+10
2004m2     A      0.03500   2.600118e+10
2004m2     B      0.02000   3.300333e+10
2004m2     C     -0.00500   4.500477e+11
2004m2     D      0.01100   1.800155e+10
2004m2     E     -0.02000   3.500288e+10
2004m2     F      0.01200   3.001109e+11
2004m2     G      0.04200   5.700442e+12
2004m2     H      0.01100   2.800888e+11
2004m2     I      0.01900   3.600115e+10
2004m2     J      0.00550   2.100333e+13
2004m2     K     -0.01500   5.001550e+10

Solution

  • You need to use the pandas.qcut() function:

    In [1]: df['port'] = df.groupby(['date'])['marketcap'].transform(
                            lambda x: pd.qcut(x, 10, labels = False))
    
    In [2]: df
    Out[2]: 
          date ticker  return     marketcap      port
    0   2004m1      A  0.0250  2.500117e+10       0.0
    1   2004m1      B  0.0400  3.600111e+10       3.0
    2   2004m1      C -0.0150  4.900222e+11       7.0
    3   2004m1      D -0.0250  1.400134e+10       0.0
    4   2004m1      E -0.0400  3.200288e+10       1.0
    5   2004m1      F -0.0120  9.300110e+10       5.0
    6   2004m1      G  0.0350  5.500512e+12       8.0
    7   2004m1      H  0.0460  2.100177e+11       6.0
    8   2004m1      I  0.0050  3.300155e+10       2.0
    9   2004m1      J  0.0075  2.400999e+13       9.0
    10  2004m1      K  0.0250  7.700155e+10       4.0
    11  2004m2      A  0.0350  2.600118e+10       0.0
    12  2004m2      B  0.0200  3.300333e+10       1.0
    13  2004m2      C -0.0050  4.500477e+11       7.0
    14  2004m2      D  0.0110  1.800155e+10       0.0
    15  2004m2      E -0.0200  3.500288e+10       2.0
    16  2004m2      F  0.0120  3.001109e+11       6.0
    17  2004m2      G  0.0420  5.700442e+12       8.0
    18  2004m2      H  0.0110  2.800888e+11       5.0
    19  2004m2      I  0.0190  3.600115e+10       3.0
    20  2004m2      J  0.0055  2.100333e+13       9.0
    21  2004m2      K -0.0150  5.001550e+10       4.0
    

    Using your toy example in Stata:

    clear
    
    input str7 date str1 ticker return marketcap
    "01/2004"     "A"      0.02500   2.500117e+10
    "01/2004"     "B"      0.04000   3.600111e+10
    "01/2004"     "C"     -0.01500   4.900222e+11
    "01/2004"     "D"     -0.02500   1.400134e+10
    "01/2004"     "E"     -0.04000   3.200288e+10
    "01/2004"     "F"     -0.01200   9.300110e+10
    "01/2004"     "G"      0.03500   5.500512e+12
    "01/2004"     "H"      0.04600   2.100177e+11
    "01/2004"     "I"      0.00500   3.300155e+10
    "01/2004"     "J"      0.00750   2.400999e+13
    "01/2004"     "K"      0.02500   7.700155e+10
    "02/2004"     "A"      0.03500   2.600118e+10
    "02/2004"     "B"      0.02000   3.300333e+10
    "02/2004"     "C"     -0.00500   4.500477e+11
    "02/2004"     "D"      0.01100   1.800155e+10
    "02/2004"     "E"     -0.02000   3.500288e+10
    "02/2004"     "F"      0.01200   3.001109e+11
    "02/2004"     "G"      0.04200   5.700442e+12
    "02/2004"     "H"      0.01100   2.800888e+11
    "02/2004"     "I"      0.01900   3.600115e+10
    "02/2004"     "J"      0.00550   2.100333e+13
    "02/2004"     "K"     -0.01500   5.001550e+10
    end
    

    The command egen produces equivalent results:

    egen port = xtile(marketcap), nquantiles(10) by(date)
    
    list, sepby(date)
    
         +---------------------------------------------+
         |    date   ticker   return   market~p   port |
         |---------------------------------------------|
      1. | 01/2004        A     .025   2.50e+10      1 |
      2. | 01/2004        B      .04   3.60e+10      4 |
      3. | 01/2004        C    -.015   4.90e+11      8 |
      4. | 01/2004        D    -.025   1.40e+10      1 |
      5. | 01/2004        E     -.04   3.20e+10      2 |
      6. | 01/2004        F    -.012   9.30e+10      6 |
      7. | 01/2004        G     .035   5.50e+12      9 |
      8. | 01/2004        H     .046   2.10e+11      7 |
      9. | 01/2004        I     .005   3.30e+10      3 |
     10. | 01/2004        J    .0075   2.40e+13     10 |
     11. | 01/2004        K     .025   7.70e+10      5 |
         |---------------------------------------------|
     12. | 02/2004        A     .035   2.60e+10      1 |
     13. | 02/2004        B      .02   3.30e+10      2 |
     14. | 02/2004        C    -.005   4.50e+11      8 |
     15. | 02/2004        D     .011   1.80e+10      1 |
     16. | 02/2004        E     -.02   3.50e+10      3 |
     17. | 02/2004        F     .012   3.00e+11      7 |
     18. | 02/2004        G     .042   5.70e+12      9 |
     19. | 02/2004        H     .011   2.80e+11      6 |
     20. | 02/2004        I     .019   3.60e+10      4 |
     21. | 02/2004        J    .0055   2.10e+13     10 |
     22. | 02/2004        K    -.015   5.00e+10      5 |
         +---------------------------------------------+
    

    The differences in the results are attributed to the fact that indexing starts with one in Stata.