Search code examples
rdataframesampling

Sampling from an R Dataframe


I have a dataframe with various real estate listings similar to the following.

ADDRESS      PRICE     ZIP     ...
123 Main St  400,000   45678
23 Green Ln  380,000   45670
29 Green Ln  385,000   45670
...

I want to perform a stratified random sample for a testing dataset. In other words, I want to take ~30% of the entries from each ZIP code and separate them into a new dataset. I am not familiar with R dataframes, so how would I perform such an operation?

I've used the sample function like so

sample(c(1:103), size=31, replace = F)

but how do I put these specific rows into a new dataframe?

8  85   5  83  66  46  39  75 101  94  10  68  63  74  22  86  42
59  52  97  62  11  44  96  88  28   9  36   2  78  49

Solution

  • For a stratified sampling you can use the createDataPartition function from the caret package by inserting the variable according to which you want to stratify (in your case ZIP). By using [[1]] you select the first element of the list which contains the row indices necessary for the split. Afterwards, you subset your original dataset by select only the rows given by train_index

    train_index <- caret::createDataPartition(your_data$ZIP, p = 0.7)[[1]]
    train_data <- your_data[train_index,]
    test_data <- your_data[-train_index,]