I am working on the common starter competition in kaggle , and realised adding Age to the classifier helps. The problem is, it has NaN values for the Age column, I don't want to fill all NaNs on the whole df, just the Age column. I apply the solution below, (by getting a median), then targeting the rows and updating like this _train['Age'] = X_train['Age'].fillna(X_train_median)
for example.
I know this is not good practice, it works but I get the following error
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Is it possible to update a specific column for all values matching a certain criteria in a df in a better way? Example code below.
# IMPORT DATA
train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")
# ASSIGN TO VAR
X_test = test_data
X = train_data
y = train_data["Survived"]
# SPLIT
X_train, X_val, Y_train, Y_val = train_test_split(X, y, random_state=1)
# SELECTED FEATURES
features = ["Pclass", "Sex", "SibSp", "Parch", "Embarked", "Age"]
# REMOVE NA's BY POPULATING WITH MEDIAN VAL
X_train_median = X_train['Age'].median()
X_val_median = X_val['Age'].median()
X_test_median = X_test['Age'].median()
X_train['Age'] = X_train['Age'].fillna(X_train_median)
X_val['Age'] = X_val['Age'].fillna(X_val_median)
X_test['Age'] = X_test['Age'].fillna(X_test_median)
# ONE HOT FOR CATAGORICAL VALS
X_train = pd.get_dummies(X_train[features])
X_val = pd.get_dummies(X_val[features])
X_test = pd.get_dummies(X_test[features])
I believe this should work:
X_train['Age'] = X_train.loc[:, 'Age'].fillna(X_train_median)
X_val['Age'] = X_val.loc[:, 'Age'].fillna(X_val_median)
X_test['Age'] = X_test.loc[:, 'Age'].fillna(X_test_median)
Docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html