Search code examples
pythonpandasdataframefinance

How to eliminate nested loops in a dataframe


I have a df with many columns that represent the market cap of companies that compose an index. The index of the df is dates.

Every 63 days/rows I would like to eliminate,for the next 63 days/rows all the values except the 500 greatest ones.

In other words: for the first 63 days/rows the only values that should be displayed are the ones from companies whose market cap was in the largest 500 in the first row.

for example:

[in]: pd.DataFrame(np.array([[1, 1, 0.5], [5 ,2, 10], [1,3, 10],[4,2, 10]]), 
                       columns=['a', 'b','c'])

[out]:     a   b   c    
      0   1.0 1.0 0.5
      1   5.0 2.0 10.0    
      2   1.0 3.0 10.0    
      3   4.0 2.0 10.0

Suppose in this example I would like to use 2 days/rows. the desired output would be:

    a   b   c
 0 1.0 1.0 NaN
 1 5.0 2.0 NaN
 2 NaN 3.0 10.0
 3 NaN 2.0 10.0

This is the code I am using now. It works, but take forever.

for x in range(0,len(dfcap)/63 - 1):
    lst = list()
    for value in dfcap.iloc[x*63].nlargest(500):
         lst.append((dfcap == value).idxmax(axis=1)[x*63])
    for column in dfcap.columns:
         for n in range(x*63,x*63 + 63):
             if column not in lst: dfcap[column][n] = 0

Solution

  • If I understand your question this should be much faster for you.
    This is my %%timeit output on 630k rows x 1000 columns in a VM running on an Intel i5.

    %%timeit -n 2 -r 2
    19.3 s ± 549 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

    import pandas as pd
    import numpy as np
    import random, string
    
    def randomticker(length):
    """ Generate random uppercase string of length given """
       letters = string.ascii_uppercase
       return ''.join(random.choice(letters) for i in range(length))
    
    # generate random data, 630k rows (dates) by 1000 columns (companies)
    data = np.random.rand(63 * 10000,1000)
    # generate 1000 random uppercase strings (stock tickers)
    companies = [randomticker(4) for x in range(1000)]
    
    df = pd.DataFrame(data, columns=companies)
    
    # Number of columns to make NA, in your case (width of DF - 500)
    mask_na_count = len(df.columns) - 500 
    # If your index is not sorted 0-n integers use this line
    # df = df.reset_index(drop=True)  
    
    for x in range(0,len(df)//63 - 1):
        # Get the smallest (width-500)  valued column names at x*63 index
        na_cols = df.iloc[x*63].nsmallest(mask_na_count).index
        # Grab chunk of 63 rows and make smallest columns np.nan
        df.loc[(x-1)*63:x*63, na_cols] = np.nan
    

    If you need your indexes as dates again you can just save off index before reset and then apply the index again save_index = df.index and df.index = save_index