I'm trying to drop some rows from a Pandas dataframe because they'd be considered outliers in data. I'm getting a KeyError when trying to drop some rows using the method my professor taught me.
gdp_2019_outliers = np.where(df_gdp['2019'] > 6)
df_gdp.drop(gdp_2019_outliers[0], inplace=True)
gdp_2019_outliers_neg = np.where(df_gdp['2019'] < -3)
df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True) # stacktrace points here as the cause
gdp_2020_outliers = np.where(df_gdp['2020'] > 3)
df_gdp.drop(gdp_2020_outliers[0], inplace=True)
gdp_2020_outliers_neg = np.where(df_gdp['2020'] < -15)
df_gdp.drop(gdp_2020_outliers_neg[0], inplace=True)
So, I find the outliers using np.where()
, then pass the list of rows to drop()
. It seems like it's trying to drop rows that are no longer in the dataframe, though -- like the first two lines of code dropped rows that were somehow refound.
Any ideas? Is there a better way to drop rows using a condition?
Stacktrack:
Traceback (most recent call last):
File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\data_wrangling_project.py", line 104, in <module>
df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True)
File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\frame.py", line 4956, in drop
return super().drop(
File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\generic.py", line 4279, in drop
###############################################################################################################
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\generic.py", line 4323, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "C:\Users\colto\Documents\Spring 2022\Data Sciences\Module 5\DataWrangling\venv\lib\site-packages\pandas\core\indexes\base.py", line 6644, in drop
raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: '[152] not found in axis'
gdp_columns = ['Country Name', '1980', '1990', '2000', '2010', '2018', '2019', '2020']
df_gdp = pd.read_csv(gdp_file, usecols=gdp_columns)
Dataset: https://www.kaggle.com/zackerym/gdp-annual-growth-for-each-country-1960-2020
Let's create the source DataFrame as:
2019 2020
0 5 2
1 6 7
2 7 -15
3 8 8
4 -4 5
5 -3 -18
6 -2 7
7 -5 -3
So far the index contains consecutive integers, starting from 0.
When you compute gdp_2019_outliers, the result is:
(array([2, 3], dtype=int64),)
And after the first drop df_gdp contains:
2019 2020
0 5 2
1 6 7
4 -4 5
5 -3 -18
6 -2 7
7 -5 -3
So far your code succeeded, because integer indices of rows are just the same as in the index of df_gdp.
Then, when you compute gdp_2019_outliers_neg, the result is:
(array([2, 5], dtype=int64),)
Now, when you attempt tu run:
df_gdp.drop(gdp_2019_outliers_neg[0], inplace=True)
an exception is thrown:
KeyError: '[2] not found in axis'
The reason why your code failed is that:
The proper code should be to use boolean indexing:
gdp_2019_outliers = df_gdp['2019'] > 6
df_gdp = df_gdp[~gdp_2019_outliers]
Then, to drop negative outliers for 2019, run:
gdp_2019_outliers_neg = df_gdp['2019'] < -3
df_gdp = df_gdp[~gdp_2019_outliers_neg]
The result, after both drops, is:
2019 2020
0 5 2
1 6 7
5 -3 -18
6 -2 7
Proceed the same way to drop other outliers.