Search code examples
pythonpandasdatabasedataframestock

Filter dataframe for conditions based on results from second filtered dataframe: Seeking efficiency boost


Seeking a more efficient solution to my clunky broken one

I have a very large dataframe of all minutely stock data for many symbols. It looks like this df:

        symbol        time   open   high    low  close  volume
0         AEHR  1628656380    652    652    652    652     200
1         AEHR  1628660580    646    646    646    646     100
2         AEHR  1628668380    634    634    634    634     300
3         AEHR  1628668920    606    608    606    608     402
4         AEHR  1628669100    615    615    615    615     100
...        ...         ...    ...    ...    ...    ...     ...
4266849   UPST  1631303160  26753  26753  26753  26753     163
4266850   UPST  1631303340  26805  26805  26805  26805     231
4266851   UPST  1631303520  26768  26768  26768  26768     226
4266852   UPST  1631303760  26819  26819  26819  26819    1964
4266853   UPST  1631303940  26899  26899  26899  26899     157

I would like to slice for efficient further processing. I only want to keep data relating to this list of "trades taken" database db_buy:

  Symbol                Time
0    AMD 2019-12-12 09:36:00
1    AMD 2020-01-16 09:33:00
2   BITF 2021-08-03 09:47:00
3   DOCN 2021-06-14 09:32:00
4   NVDA 2020-07-29 09:38:00
5   NVDA 2020-09-25 10:34:00
6   UPST 2021-02-09 09:32:00
7   UPST 2021-03-18 09:32:00

I only want to keep the data where

  • df['symbol'] = df_buy['Symbol']; and,
  • df['time'] > df_buy['Time'] - BDay(25)

My solutions feels like it could be done in a few lines:

    db_buy.sort_index(inplace=True)
    no_match = np.nan

    # Find start date required for slice
    tickers = pd.DataFrame()
    tickers['Symbol'] = db_buy['Symbol'].unique()
    tickers.sort_index(inplace=True)
    print(tickers)
    for i, row in tickers.iterrows():
        cond = (db_buy['Symbol'] == tickers['Symbol'])
        same_symbol = db_buy[cond]  # gets df of same tickers ['Symbol', 'Time']
        min_date = same_symbol['Time'].min()  # returns min value
        match = no_match if not cond.any() else min_date
        # ^ Returns 'NAN' if cond=false, else match=first row of df which is true
        tickers.loc[i, 'Time'] = match

    # Get data for stocks in buy list only. Time data starting 25 days prior to buy date to now
    cond = (
            (df['Symbol'] == tickers['Symbol']) &
            (df['Time'] > tickers['Time'] + BDay(-25))
    )
    df = df[cond]

cond = (db_buy['Symbol'] == tickers['Symbol']) ValueError: Can only compare identically-labeled Series objects


Solution

  • You can use a simple slicing on combined two conditions:

    # ensure datetime type
    df['time'] = pd.to_datetime(df['time'], unit='s')
    d_buy['Time'] = pd.to_datetime(d_buy['Time'])
    
    # slice data
    df[  df['symbol'].isin(d_buy['Symbol'].unique())            # symbol is in Symbol
       & df['time'].gt(d_buy['Time'].min()-pd.Timedelta('25d')) # time > min(Time)-25days
      ]
    

    output:

            symbol                time   open   high    low  close  volume
    4266849   UPST 2021-09-10 19:46:00  26753  26753  26753  26753     163
    4266850   UPST 2021-09-10 19:49:00  26805  26805  26805  26805     231
    4266851   UPST 2021-09-10 19:52:00  26768  26768  26768  26768     226
    4266852   UPST 2021-09-10 19:56:00  26819  26819  26819  26819    1964
    4266853   UPST 2021-09-10 19:59:00  26899  26899  26899  26899     157