Search code examples
pythonlistnumpyperformancediff

How to drop (delete) consecutive values in a Dataframe


I have a dataframe with a column that has 0 Values. I wish to find those 0 values and check if till the end they are 0, drop only those at the end and not in the middle.

this is how the Data in secondary_df looks like:

      DSCD        date    year    month  RI    RIu   RIu1    RIe  
203  1316    1/29/2010  2010.0    1.0  66.39  66.30  6.21    6.39  \
275  1316    1/29/2016  2016.0    1.0  66.97  166.84  6.89   6.32   
131  1316    1/30/2004  2004.0    1.0  66.01  66.15   6.36   6.60   
191  1316    1/30/2009  2009.0    1.0  66.36  6.54  685.25   6.71   
263  1316    1/30/2015  2015.0    1.0  66.43  6.94  114.14   6.33   
..    ...       ...      ...      ...    ...   ...     ...     ...   
250  1316   12/31/2013  2013.0   12.0  99.98   5.24    59.91   5.07   
262  1316   12/31/2014  2014.0   12.0  99.33   54.14   54.64   55.96   
274  1316   12/31/2015  2015.0   12.0  55.32   5.89    15.19  54.34   
310  1316   12/31/2018  2018.0   12.0  55.56   55.23   5.40   5.49   
322  1316   12/31/2019  2019.0   12.0  55.39   55.98   5.69   5.88  



        RIu Pct Return  RIe_Pct_Return   
203        -0.05        0.0255  \
275        -0.0358      -0.059   
131         0.058        0.05106   
191         0.0055       0.0535   
263        -0.035        0.053   
..               ...             ...   
250         0.01092       -0.05   
262        -0.001        0.02572   
274        -0.003       -0.0512   
310        -0.000       -0.05274   
322         0.004        0.039   

This is what I got so far.

for DSCD in FirmReturnIndexValues['DSCD'].unique():
    secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD']==DSCD]
    t=secondary_df[(secondary_df['RIe Pct Return'].values == 0)].index.values.tolist()
    t.sort()
    if len(t)>=1:
        print(np.diff((t)))

for example this part is t:

[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658]

this is the Indexes that I get from my code and when I use the np.diff() method I get this values and the values I wish to drop (delete) are being bolded:

[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 80 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

so I have 2 Questions.

  1. how can I delete the bolded one's?
  2. the first for loop contains 8000 DSCD's is there anyway it can be more efficient?

Another Example: list t:

[74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]

result of np.diff():

[1 1 2 2 1 1 1 2 4 3 1 4 1 1 1 1 3 1 1 1 2 1 1 1 1 3 1 1 2 2 1 1 1 1 1 1 1 1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Hard one:

[6359, 6431, 6287, 6347, 6419, 6263, 6275, 6299, 6311, 6323, 6335, 6371, 6383, 6395, 6407, 6443, 6455, 6467, 6360, 6288, 6348, 6420, 6252, 6264, 6276, 6300, 6312, 6324, 6372, 6396, 6408, 6444, 6456, 6468, 6336, 6384, 6432, 6265, 6397, 6469, 6253, 6325, 6385, 6457, 6277, 6289, 6301, 6313, 6337, 6349, 6361, 6373, 6409, 6421, 6433, 6445, 6314, 6446, 6302, 6374, 6434, 6254, 6266, 6278, 6290, 6326, 6338, 6350, 6362, 6386, 6398, 6410, 6422, 6458, 6470, 6423, 6279, 6339, 6411, 6255, 6267, 6291, 6303, 6315, 6327, 6363, 6375, 6387, 6399, 6435, 6447, 6459, 6471, 6268, 6400, 6472, 6256, 6328, 6388, 6460, 6280, 6292, 6304, 6316, 6340, 6351, 6364, 6376, 6412, 6424, 6436, 6448, 6305, 6377, 6437, 6293, 6365, 6257, 6269, 6281, 6317, 6329, 6341, 6389, 6401, 6413, 6425, 6449, 6461, 6473, 6282, 6342, 6414, 6270, 6402, 6474, 6258, 6294, 6306, 6318, 6330, 6353, 6366, 6378, 6390, 6426, 6438, 6450, 6462, 6259, 6331, 6391, 6463, 6319, 6451, 6271, 6283, 6295, 6307, 6343, 6355, 6367, 6379, 6403, 6415, 6427, 6439, 6475, 6296, 6368, 6356, 6428, 6260, 6272, 6284, 6308, 6320, 6332, 6344, 6380, 6392, 6404, 6416, 6440, 6452, 6464, 6476, 6285, 6345, 6417, 6273, 6405, 6477, 6261, 6297, 6309, 6321, 6333, 6357, 6369, 6381, 6393, 6429, 6441, 6453, 6465, 6322, 6454, 6310, 6382, 6442, 6262, 6274, 6286, 6298, 6334, 6346, 6358, 6370, 6394, 6406, 6418, 6430, 6466, 6478]

after sorting the previous list you'll get:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

as the same logic, the bold ones need to be removed


Solution

  • I think this is what you are looking for (Full example)

    Pandas Version:

    s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
    
    s[:s.index[s.diff().ne(1)][-1]+1]
    
    returns:
    [69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]
    

    Logic explanation. First diff gets to where you were.

    ne(1) - only get locations where diff is not equal to 1 and then we return the indices for those points.

    [-1] get the last index add 2 and then slice the original array.

    Numpy Version:

    # stay in pandas series/df and numpy arrays as much as possible avoid lists
    for DSCD in FirmReturnIndexValues['DSCD'].unique():
        secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
        t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values
        t.sort()
        if len(t)>=1:
            
            # get last index where diff is not 1
            ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
            
            # add 2 to index to remove the bolded numbers per your example
            result = t[:ind+2]
    

    Answer Continued:

    In general here are some simple speed ups for the code you have shown. But it is unclear what exactly you have as an input df and desire as and result. If you can provide that we can benchmark it.

    # use direct numpy method calls (eq,lt,gt,ge,le,ne) instead of ==, <, >, etc..
    secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
            t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values  # don't convert to list if you do not need to
            t.sort()  # use numpy array sort
            if len(t)>=1:
                print(np.diff((t)))
    
    

    These are quick wins. You likely have structural speed ups from a refactor. But you need to provide a full working example and expected output.