I have a dataframe with a column that has 0 Values. I wish to find those 0 values and check if till the end they are 0, drop only those at the end and not in the middle.
this is how the Data in secondary_df
looks like:
DSCD date year month RI RIu RIu1 RIe
203 1316 1/29/2010 2010.0 1.0 66.39 66.30 6.21 6.39 \
275 1316 1/29/2016 2016.0 1.0 66.97 166.84 6.89 6.32
131 1316 1/30/2004 2004.0 1.0 66.01 66.15 6.36 6.60
191 1316 1/30/2009 2009.0 1.0 66.36 6.54 685.25 6.71
263 1316 1/30/2015 2015.0 1.0 66.43 6.94 114.14 6.33
.. ... ... ... ... ... ... ... ...
250 1316 12/31/2013 2013.0 12.0 99.98 5.24 59.91 5.07
262 1316 12/31/2014 2014.0 12.0 99.33 54.14 54.64 55.96
274 1316 12/31/2015 2015.0 12.0 55.32 5.89 15.19 54.34
310 1316 12/31/2018 2018.0 12.0 55.56 55.23 5.40 5.49
322 1316 12/31/2019 2019.0 12.0 55.39 55.98 5.69 5.88
RIu Pct Return RIe_Pct_Return
203 -0.05 0.0255 \
275 -0.0358 -0.059
131 0.058 0.05106
191 0.0055 0.0535
263 -0.035 0.053
.. ... ...
250 0.01092 -0.05
262 -0.001 0.02572
274 -0.003 -0.0512
310 -0.000 -0.05274
322 0.004 0.039
This is what I got so far.
for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD']==DSCD]
t=secondary_df[(secondary_df['RIe Pct Return'].values == 0)].index.values.tolist()
t.sort()
if len(t)>=1:
print(np.diff((t)))
for example this part is t:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658]
this is the Indexes that I get from my code and when I use the np.diff()
method I get this values and the values I wish to drop (delete) are being bolded:
[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 80 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
so I have 2 Questions.
Another Example: list t:
[74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]
result of np.diff():
[1 1 2 2 1 1 1 2 4 3 1 4 1 1 1 1 3 1 1 1 2 1 1 1 1 3 1 1 2 2 1 1 1 1 1 1 1 1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Hard one:
[6359, 6431, 6287, 6347, 6419, 6263, 6275, 6299, 6311, 6323, 6335, 6371, 6383, 6395, 6407, 6443, 6455, 6467, 6360, 6288, 6348, 6420, 6252, 6264, 6276, 6300, 6312, 6324, 6372, 6396, 6408, 6444, 6456, 6468, 6336, 6384, 6432, 6265, 6397, 6469, 6253, 6325, 6385, 6457, 6277, 6289, 6301, 6313, 6337, 6349, 6361, 6373, 6409, 6421, 6433, 6445, 6314, 6446, 6302, 6374, 6434, 6254, 6266, 6278, 6290, 6326, 6338, 6350, 6362, 6386, 6398, 6410, 6422, 6458, 6470, 6423, 6279, 6339, 6411, 6255, 6267, 6291, 6303, 6315, 6327, 6363, 6375, 6387, 6399, 6435, 6447, 6459, 6471, 6268, 6400, 6472, 6256, 6328, 6388, 6460, 6280, 6292, 6304, 6316, 6340, 6351, 6364, 6376, 6412, 6424, 6436, 6448, 6305, 6377, 6437, 6293, 6365, 6257, 6269, 6281, 6317, 6329, 6341, 6389, 6401, 6413, 6425, 6449, 6461, 6473, 6282, 6342, 6414, 6270, 6402, 6474, 6258, 6294, 6306, 6318, 6330, 6353, 6366, 6378, 6390, 6426, 6438, 6450, 6462, 6259, 6331, 6391, 6463, 6319, 6451, 6271, 6283, 6295, 6307, 6343, 6355, 6367, 6379, 6403, 6415, 6427, 6439, 6475, 6296, 6368, 6356, 6428, 6260, 6272, 6284, 6308, 6320, 6332, 6344, 6380, 6392, 6404, 6416, 6440, 6452, 6464, 6476, 6285, 6345, 6417, 6273, 6405, 6477, 6261, 6297, 6309, 6321, 6333, 6357, 6369, 6381, 6393, 6429, 6441, 6453, 6465, 6322, 6454, 6310, 6382, 6442, 6262, 6274, 6286, 6298, 6334, 6346, 6358, 6370, 6394, 6406, 6418, 6430, 6466, 6478]
after sorting the previous list you'll get:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
as the same logic, the bold ones need to be removed
I think this is what you are looking for (Full example)
Pandas Version:
s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
s[:s.index[s.diff().ne(1)][-1]+1]
returns:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]
Logic explanation. First diff gets to where you were.
ne(1) - only get locations where diff is not equal to 1 and then we return the indices for those points.
[-1] get the last index add 2 and then slice the original array.
Numpy Version:
# stay in pandas series/df and numpy arrays as much as possible avoid lists
for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values
t.sort()
if len(t)>=1:
# get last index where diff is not 1
ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
# add 2 to index to remove the bolded numbers per your example
result = t[:ind+2]
Answer Continued:
In general here are some simple speed ups for the code you have shown. But it is unclear what exactly you have as an input df and desire as and result. If you can provide that we can benchmark it.
# use direct numpy method calls (eq,lt,gt,ge,le,ne) instead of ==, <, >, etc..
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values # don't convert to list if you do not need to
t.sort() # use numpy array sort
if len(t)>=1:
print(np.diff((t)))
These are quick wins. You likely have structural speed ups from a refactor. But you need to provide a full working example and expected output.