I have a dataframe with various real estate listings similar to the following.
ADDRESS PRICE ZIP ...
123 Main St 400,000 45678
23 Green Ln 380,000 45670
29 Green Ln 385,000 45670
...
I want to perform a stratified random sample for a testing dataset. In other words, I want to take ~30% of the entries from each ZIP code and separate them into a new dataset. I am not familiar with R dataframes, so how would I perform such an operation?
I've used the sample function like so
sample(c(1:103), size=31, replace = F)
but how do I put these specific rows into a new dataframe?
8 85 5 83 66 46 39 75 101 94 10 68 63 74 22 86 42
59 52 97 62 11 44 96 88 28 9 36 2 78 49
For a stratified sampling you can use the createDataPartition
function from the caret
package by inserting the variable according to which you want to stratify (in your case ZIP
). By using [[1]]
you select the first element of the list which contains the row indices necessary for the split. Afterwards, you subset your original dataset by select only the rows given by train_index
train_index <- caret::createDataPartition(your_data$ZIP, p = 0.7)[[1]]
train_data <- your_data[train_index,]
test_data <- your_data[-train_index,]