I need help with the specific code I will paste below. I am using the Ames Housing data set collected by Dean De Cock.
I am using a Python notebook and editing thru Anaconda's Jupyter Lab 2.1.5.
The code below is supposed to replace all np.nan
or "None" values. For some reason,
after repeatedly calling a hand-made function inside a for loop, the columns of the resulting data frame get swapped around.
Note: I am aware I could do this with an "imputer." I plan to select numeric and object type features, impute them separately then put them back together. As a side-note, is there any way I can do that while having the details I output manually using text displayed or otherwise verified?
In the cell in question, the flow is:
df_train
.df_train
. The syntax is sr_null_counts = df_train.isnull().sum()
.df_train
in-place
used to be within the for-loop.df_train
.I used a function that has a for-loop and nested try-except statements to:
NaN
or None
) values with the mean of the passed series.
(b) If that fails, try to replace all null values with the median of the series.
(c) If even that fails, replace all null values with the mode of the series.The final line is to drop all the columns marked as having more than 5% missing values.
Here is the full code:
The full data-set was loaded thru df_housing = pd.read_csv(sep = '\t', filepath_or_buffer = "AmesHousing.tsv")
.
def make_traintest(df, train_fraction = 0.7, random_state_val = 88):
df = df.copy()
df_train = df.sample(frac = train_fraction, random_state = random_state_val)
bmask_istrain = df.index.isin(df_train.index.values)
df_test = df.loc[ ~bmask_istrain ]
return {
"train":df_train,
"test":df_test
}
dict_traintest = make_traintest(df = df_housing)
df_train = dict_traintest["train"]
df_test = dict_traintest["test"]
lst_have_nulls = []
for feature in df_housing.columns.values.tolist():
nullcount = df_housing[feature].isnull().sum()
if nullcount > 0:
lst_have_nulls.append(feature)
print(feature, "\n=====\nNull Count:\t", nullcount, '\n', df_housing[feature].value_counts(dropna = False),'\n*****')
def impute_series(sr_values, feature_name = ''):
sr_out = sr_values.copy()
try:
sr_out.fillna(value = sr_values.mean())
print("Feature", feature_name, "imputed with mean:", sr_values.mean())
except Exception as e:
print("Filling NaN values with mean of feature", feature_name, "caused an error:\n", e)
try:
sr_out.fillna(value = sr_values.median())
print("Feature", feature_name, "imputed with median:", sr_values.median())
except Exception as e:
print("Filling NaN values with median for feature", feature_name, "caused an error:\n", e)
sr_out.fillna(value = sr_values.mode())
print("Feature", feature_name, "imputed with mode:", sr_values.mode())
return sr_out
Getting the count of null values, defining the empty list of columns to drop to allow appending, and repeatedly
doing the following: For every column in lst_have_nulls
, check if the column has equal, less or more than 5% missing values.
If more, append the column to lst_drop
. Else, call the hand-made imputing function. After the for-loop, drop all columns in
lst_drop
, in-place.
Where did I go wrong? In case you need the entire notebook, I have uploaded it to Kaggle. Here is a link. https://www.kaggle.com/joachimrives/ames-housing-public-problem
When I tried the code of Anvar Kurmukov, my dataframe column values still got swapped. The change I made was adding int
and float
to the list of dtypes to check. The changes are inside the for-loop:
if dtype in [np.int64, np.float64, int, float]
.
It may be a problem with another part of my code in the full notebook. I will need to check where it is by calling df_train.info()
cell by cell from the top. I tested the code in the notebook I made public. It is in cell 128. For some reason, after running Anvar's code, the df_train.info()
method returned this:
1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath Bsmt Half Bath ... Roof Style SalePrice Screen Porch Street TotRms AbvGrd Total Bsmt SF Utilities Wood Deck SF Year Built Year Remod/Add
1222 1223 534453140 70 RL 50.0 4882 Pave NaN IR1 Bnk ... 0 0 0 0 0 NaN NaN NaN 0 87000
1642 1643 527256040 20 RL 81.0 13870 Pave NaN IR1 HLS ... 52 0 0 174 0 NaN NaN NaN 0 455000
1408 1409 905427050 50 RL 66.0 21780 Pave NaN Reg Lvl ... 36 0 0 144 0 NaN NaN NaN 0 185000
1729 1730 528218050 60 RL 65.0 10237 Pave NaN Reg Lvl ... 72 0 0 0 0 NaN NaN NaN 0 178900
1069 1070 528180110 120 RL 58.0 10110 Pave NaN IR1 Lvl ... 48 0 0 0 0 NaN NaN NaN 0 336860
I discovered the answer as to why my columns were being swapped. They were not actually being swapped. The original problem was that I had not set the "Order" column as the index column. To fix the problem on the notebook in my PC, I simply added the following paramater and value to pd.read_csv
: index_col = "Order"
. That fixed the problem on my local notebook. When I tried it on the Kaggle notebook, however, it did not fix the problem
The version of the Ames Housing data set I first used on the notebook - for some reason - was also the cause for the column swapping.
Anvar's Code is fine. You may test the code I wrote, but to be safe, defer to Anvar's code. Mine is still to be tested.
I modified the Kaggle notebook I linked in my question. I used the data set I was actually working in with my PC. When I did that, the code given by Anvar Kurmukov's answer worked perfectly. I tested my own code and it seems fine, but test both versions before trying. I only reviewed the data sets using head()
and manually checked the column inputs. If you want to check the notebook, here it is:
https://www.kaggle.com/joachimrives/ames-housing-public-problem/
To test if the data set was at fault, I created to data frames. One was taken directly from my local file uploaded to Kaggle. The other used the current version of the Ames Iowa Housing data set I had used as input. The columns were properly "aligned" with their expected input. To find the expected column values, I used this source: http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Here are the screenshots of the different results I got when I swapped data sets: