I found a tutorial a while back but cannot locate it again that created an extra column in both the train and the test set that specified true or false for being the training set. I have the code but could not locate where I found it.
titanic.train$IsTrainingSet <- TRUE
titanic.test$IsTrainingSet <- FALSE
Is this good practice or bad practice? I'm just curious because I like how easy it is to split the data after performing your data cleaning and manipulation as below.
titanic.train <- titanic.full[titanic.full$IsTrainingSet == TRUE,]
titanic.test <- titanic.full[titanic.full$IsTrainingSet == FALSE,]
I know there are probably going to be answers of "do what you want to do" but I just didn't know if this was bad practice for any reason to add another column to the data.
Will expand my comment. The tutorial the OP is referring to is here
https://www.kaggle.com/hiteshp/head-start-for-data-scientist
The author of the tutorial put the two sets together to look at all the data. Now a warning: before doing something like that you should check that the two sets have the same characteristics (or as sometime said that they come from the same distribution), otherwise you may end up drawing very wrong conclusions. Better would be to compare the two sets to check if the test set is representative of the training one. That would be much more helpful.
Sometime the dev/test sets are coming from different sources, so be aware of doing something like that, as it may be dangerous.
I hope it helps, Umberto