Why is my dataframe unable to drop negative values appropriately?

I have a dataset and want to drop rows which contain negative values for a specific column. I have tried three different ways and all yield the same very odd outcome.

I will lay out my interaction with the dataframe, show the dataframe and the 'wrong' result that I am getting vs. what I would like to get. Lastly I'll show what I tried so far. Using Python3.

After import data = pd.read_csv('data.csv'),

I interrogate the dataframe for shape, min/max values and a data time span using pandas functionality, but I don't alter it.
I drop a number of columns using short data = data.drop(['Col3', 'Col4', 'Col5'], axis=1) and ensure that the remaining columns are floats using.astype()
Finally I sort the dataframe from small to big sorted_data = short_data.sort_values(['Col2'], axis=0, ascending=[True])

At this point I want to drop all rows where Col2 has negative values. However all three approaches that I have tried so far have delievered the same 'wrong' result. Instead of dropping negatives, they drop everything except 0.


# my input dataframe 'sorted_data'
               Col1        Col 2
249182         0.3         -8.82
132350         0.3         -7.08
56440         -0.2         -4.66
265662         2.8         -1.17
143497        -0.3          0.00
...            ...           ...
35475         98.5        906.23
35526         98.5        906.23
35338         98.6        906.25
35834         98.5        906.28
35793         98.5        906.39

[286996 rows x 2 columns]


# my wrong output dataframe
              Col1          Col2
143497        -0.3           0.0
172367         0.1           0.0
172366         0.2           0.0
172365         0.2           0.0
172364         0.2           0.0
...        ...         ...           ...
201443         0.2           0.0
201460         0.1           0.0
201445         0.2           0.0
201444         0.2           0.0
201446         0.2           0.0

[137023 rows x 2 columns]

I got to this odd result via three different ways:


# dropping negatives
positive_data = sorted_data.drop(sorted_data.loc[sorted_data["Col2"] < 0.0].index, inplace=False)

#keeping positives
positive_data = sorted_data[sorted_data['Col2'] >= 0.0]

#convert negative to NaN, then drop Nans
sorted_data.loc[sorted_data["Col2"] < 0.0, "Col2"] = 'NaN'
positive_data = sorted_data.dropna(subset = 'Col2')

For that third option it replaces the negative values correctly, but when then dropping NaN it gives the same odd result as the other two.

Is my dataframe broken somehow? What could cause this behaviour?

Solution

You might not be calling the correct input and output variables for these testings. Why I suggest this:

Options 1 and 2: Your lines work perfectly. Using the few rows at hand and your lines exactly ("dropping negatives" and "keeping positives"), I get the 6 zero or positives lines, not "Instead of dropping negatives, they drop everything except 0." So far, I cannot reproduce your issue

Option 3: Now that one should cause a problem. "NaN" has nothing to see with an actual Not a Number, so dropna will not recognize it. You should use numpy.nan. A correct syntax for it is

import numpy as np

#convert negative to NaN, then drop Nans
cleared_data = sorted_data.copy()
cleared_data.loc[sorted_data["Col2"] < 0.0, "Col2"] = np.nan
cleared_data.dropna(subset = 'Col2')

Then it returns the same, correctly filtered dataframe, on my side.

Hence when you say "it gives the same odd result as the other two", seems indicative that your root problem does not come from either of those 3 attempts.