I have a dataset and want to drop rows which contain negative values for a specific column. I have tried three different ways and all yield the same very odd outcome.
I will lay out my interaction with the dataframe, show the dataframe and the 'wrong' result that I am getting vs. what I would like to get. Lastly I'll show what I tried so far. Using Python3.
After import data = pd.read_csv('data.csv')
,
short data = data.drop(['Col3', 'Col4', 'Col5'], axis=1)
and ensure that the remaining columns are floats using.astype()
sorted_data = short_data.sort_values(['Col2'], axis=0, ascending=[True])
At this point I want to drop all rows where Col2 has negative values. However all three approaches that I have tried so far have delievered the same 'wrong' result. Instead of dropping negatives, they drop everything except 0.
# my input dataframe 'sorted_data'
Col1 Col 2
249182 0.3 -8.82
132350 0.3 -7.08
56440 -0.2 -4.66
265662 2.8 -1.17
143497 -0.3 0.00
... ... ...
35475 98.5 906.23
35526 98.5 906.23
35338 98.6 906.25
35834 98.5 906.28
35793 98.5 906.39
[286996 rows x 2 columns]
# my wrong output dataframe
Col1 Col2
143497 -0.3 0.0
172367 0.1 0.0
172366 0.2 0.0
172365 0.2 0.0
172364 0.2 0.0
... ... ... ...
201443 0.2 0.0
201460 0.1 0.0
201445 0.2 0.0
201444 0.2 0.0
201446 0.2 0.0
[137023 rows x 2 columns]
I got to this odd result via three different ways:
# dropping negatives
positive_data = sorted_data.drop(sorted_data.loc[sorted_data["Col2"] < 0.0].index, inplace=False)
#keeping positives
positive_data = sorted_data[sorted_data['Col2'] >= 0.0]
#convert negative to NaN, then drop Nans
sorted_data.loc[sorted_data["Col2"] < 0.0, "Col2"] = 'NaN'
positive_data = sorted_data.dropna(subset = 'Col2')
For that third option it replaces the negative values correctly, but when then dropping NaN it gives the same odd result as the other two.
Is my dataframe broken somehow? What could cause this behaviour?
You might not be calling the correct input and output variables for these testings. Why I suggest this:
Options 1 and 2: Your lines work perfectly. Using the few rows at hand and your lines exactly ("dropping negatives" and "keeping positives"), I get the 6 zero or positives lines, not "Instead of dropping negatives, they drop everything except 0." So far, I cannot reproduce your issue
Option 3: Now that one should cause a problem. "NaN" has nothing to see with an actual Not a Number, so dropna
will not recognize it. You should use numpy.nan
. A correct syntax for it is
import numpy as np
#convert negative to NaN, then drop Nans
cleared_data = sorted_data.copy()
cleared_data.loc[sorted_data["Col2"] < 0.0, "Col2"] = np.nan
cleared_data.dropna(subset = 'Col2')
Then it returns the same, correctly filtered dataframe, on my side.
Hence when you say "it gives the same odd result as the other two", seems indicative that your root problem does not come from either of those 3 attempts.