Search code examples
functiondataframefor-loopscikit-learnimputation

Columns of Data Frame are Being Swapped: Why is my loop switching the column values when I identify and assign the columns by name?


I need help with the specific code I will paste below. I am using the Ames Housing data set collected by Dean De Cock. I am using a Python notebook and editing thru Anaconda's Jupyter Lab 2.1.5. The code below is supposed to replace all np.nan or "None" values. For some reason, after repeatedly calling a hand-made function inside a for loop, the columns of the resulting data frame get swapped around.

Note: I am aware I could do this with an "imputer." I plan to select numeric and object type features, impute them separately then put them back together. As a side-note, is there any way I can do that while having the details I output manually using text displayed or otherwise verified?

In the cell in question, the flow is:

  1. Get and assign the number of data points in the data frame df_train.
  2. Get and assign a series that lists the count of null values in df_train. The syntax is sr_null_counts = df_train.isnull().sum().
  3. Create an empty list to which names of features that have 5% of their values equal to null are appended. They will be dropped later, outside the for loop. I thought at first that this was the problem since the command to drop the columns of df_train in-place used to be within the for-loop.
  4. Repeatedly call a hand-made function to impute columns with null values not exceeding 5% of the row count for df_train.

I used a function that has a for-loop and nested try-except statements to:

  1. Accept a series and, optionally, the series' name when it was a column in a dataframe. It assigns a copy of the passed series to a local variable.
  2. In the exact order, (a) try to replace all null (NaN or None) values with the mean of the passed series. (b) If that fails, try to replace all null values with the median of the series. (c) If even that fails, replace all null values with the mode of the series.
  3. Return the edited copy of the series with all null values replaced. It should also print out strings that tell me what feature was modified and what summary statistic was used to replace/impute the missing values.

The final line is to drop all the columns marked as having more than 5% missing values.

Here is the full code:

Splitting the main dataframe into a train and test set.

The full data-set was loaded thru df_housing = pd.read_csv(sep = '\t', filepath_or_buffer = "AmesHousing.tsv").

def make_traintest(df, train_fraction = 0.7, random_state_val = 88):
    df = df.copy()
    df_train = df.sample(frac = train_fraction, random_state = random_state_val)    
    bmask_istrain = df.index.isin(df_train.index.values)
    df_test = df.loc[ ~bmask_istrain ]
    
    return {
        "train":df_train,
        "test":df_test
        }

dict_traintest = make_traintest(df = df_housing)
df_train = dict_traintest["train"]
df_test = dict_traintest["test"]

Get a List of Columns With Null Values

lst_have_nulls = []
for feature in df_housing.columns.values.tolist():
    nullcount = df_housing[feature].isnull().sum()
    if nullcount > 0:
        lst_have_nulls.append(feature)
        print(feature, "\n=====\nNull Count:\t", nullcount, '\n', df_housing[feature].value_counts(dropna = False),'\n*****')

Definition of the hand-made function:

def impute_series(sr_values, feature_name = ''):
    sr_out = sr_values.copy()
    try:        
        sr_out.fillna(value = sr_values.mean())
        print("Feature", feature_name, "imputed with mean:", sr_values.mean())
    except Exception as e:
        print("Filling NaN values with mean of feature", feature_name, "caused an error:\n", e)
        try:
            sr_out.fillna(value = sr_values.median())
            print("Feature", feature_name, "imputed with median:", sr_values.median())
        except Exception as e:
            print("Filling NaN values with median for feature", feature_name, "caused an error:\n", e)
            sr_out.fillna(value = sr_values.mode())
            print("Feature", feature_name, "imputed with mode:", sr_values.mode())            
    
    return sr_out

For-Loop

Getting the count of null values, defining the empty list of columns to drop to allow appending, and repeatedly doing the following: For every column in lst_have_nulls, check if the column has equal, less or more than 5% missing values. If more, append the column to lst_drop. Else, call the hand-made imputing function. After the for-loop, drop all columns in lst_drop, in-place.

Where did I go wrong? In case you need the entire notebook, I have uploaded it to Kaggle. Here is a link. https://www.kaggle.com/joachimrives/ames-housing-public-problem

Update: Problem Still Exists After Testing Anvar's Answer with Changes

When I tried the code of Anvar Kurmukov, my dataframe column values still got swapped. The change I made was adding int and float to the list of dtypes to check. The changes are inside the for-loop: if dtype in [np.int64, np.float64, int, float].

It may be a problem with another part of my code in the full notebook. I will need to check where it is by calling df_train.info() cell by cell from the top. I tested the code in the notebook I made public. It is in cell 128. For some reason, after running Anvar's code, the df_train.info() method returned this:

    1st Flr SF  2nd Flr SF  3Ssn Porch  Alley   Bedroom AbvGr   Bldg Type   Bsmt Cond   Bsmt Exposure   Bsmt Full Bath  Bsmt Half Bath  ... Roof Style  SalePrice   Screen Porch    Street  TotRms AbvGrd   Total Bsmt SF   Utilities   Wood Deck SF    Year Built  Year Remod/Add
1222    1223    534453140   70  RL  50.0    4882    Pave    NaN IR1 Bnk ... 0   0   0   0   0   NaN NaN NaN 0   87000
1642    1643    527256040   20  RL  81.0    13870   Pave    NaN IR1 HLS ... 52  0   0   174 0   NaN NaN NaN 0   455000
1408    1409    905427050   50  RL  66.0    21780   Pave    NaN Reg Lvl ... 36  0   0   144 0   NaN NaN NaN 0   185000
1729    1730    528218050   60  RL  65.0    10237   Pave    NaN Reg Lvl ... 72  0   0   0   0   NaN NaN NaN 0   178900
1069    1070    528180110   120 RL  58.0    10110   Pave    NaN IR1 Lvl ... 48  0   0   0   0   NaN NaN NaN 0   336860

Solution

  • Answer

    I discovered the answer as to why my columns were being swapped. They were not actually being swapped. The original problem was that I had not set the "Order" column as the index column. To fix the problem on the notebook in my PC, I simply added the following paramater and value to pd.read_csv: index_col = "Order". That fixed the problem on my local notebook. When I tried it on the Kaggle notebook, however, it did not fix the problem

    The version of the Ames Housing data set I first used on the notebook - for some reason - was also the cause for the column swapping.

    Anvar's Code is fine. You may test the code I wrote, but to be safe, defer to Anvar's code. Mine is still to be tested.

    Testing Done

    I modified the Kaggle notebook I linked in my question. I used the data set I was actually working in with my PC. When I did that, the code given by Anvar Kurmukov's answer worked perfectly. I tested my own code and it seems fine, but test both versions before trying. I only reviewed the data sets using head() and manually checked the column inputs. If you want to check the notebook, here it is: https://www.kaggle.com/joachimrives/ames-housing-public-problem/

    To test if the data set was at fault, I created to data frames. One was taken directly from my local file uploaded to Kaggle. The other used the current version of the Ames Iowa Housing data set I had used as input. The columns were properly "aligned" with their expected input. To find the expected column values, I used this source: http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

    Here are the screenshots of the different results I got when I swapped data sets:

    With an uploaded copy of my local file:

    Data frame df_train has the PID column and all columns match expected input.

    With the original AmesHousing.csv From Notebook Version 1:

    The values of PID are assigned to 1st Flr Sf and PID is moved elsewhere.

    The data set I Used that Caused the Column-swap on the Kaggle Notebook

    https://www.kaggle.com/marcopale/housing enter image description here