Search code examples
pythonpandasindexingstatisticsrow

Selecting a row of pandas dataframe that is below threshold from another dataframe/series


I am attempting to run sequential analysis on each point within a time-series of length 100 for 44 pedal revolutions (cycles).

To do this I am trying to return the index value of the row within a mean deviation DataFrame (44, 100) when the value of the row falls below a given 0.25 * SD threshold (stored as (44,) Series). I would like this code to compare each column of the mean_dev DataFrame with each corresponding value of sd_quarter Series.

Here is the code:

# transpose (100, 44) df (cycles) to (44, 100) to allow expanding.mean
transpose_cycles = cycles.transpose()
# create empty dataframes to store outputs
cumulative_ave = pd.DataFrame()
sd_quarter = pd.DataFrame()
cum_sd = pd.DataFrame()
mean_dev = pd.DataFrame()
# cumulative average for whole time-series of each pedal revolution
cumulative_ave = transpose_cycles.expanding().mean()
# transpose mean_cycle (100, 1) to (1, 100) to allow calculation with cumulative_ave
mean_transpose = mean_cycle.transpose()
# calculate mean deviation (average cycle - cumulative average)
mean_dev = mean_cycle - cumulative_ave
# standard deviation for each pedal revolution
cum_sd = transpose_cycles.std(axis=1)
# SD*0.25
sd_quarter = cum_sd*0.25

I now need to compare each column of mean_dev to each sd_quarter value and return the row index of when the row < sd_quarter. Ideally this would be outputted as a list of 44 numbers.

print(mean_dev.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 44 entries, 0 to 43 Data columns (total 100 columns): dtypes: float64(100)

print(sd_quarter.info()) <class 'pandas.core.series.Series'> RangeIndex: 44 entries, 0 to 43 Series name: None Non-Null Count Dtype


44 non-null float64 dtypes: float64(1)


Solution

  • Could you try this:

    # compare each row of mean_dev with sd_quarter and return row index where row < sd_quarter
    result = mean_dev.apply(lambda row: row < sd_quarter[row.name], axis=1)
    
    # get index of first True value for each column
    index_list = result.idxmax()
    
    # convert index_list to a list
    index_list = index_list.tolist()
    

    Keep in mind that you'll need to have defined the cycles and mean_cycle DataFrames before running this code.