Search code examples
rdataframesample

When using a vector to take a subset of a dataframe, why are the resultant rows offset?


Please see my code block below - in essence, I am using the Auto dataset from the ISLR library, scaling the dataframe's quantitative predictors, taking a random sample of the dataframe, and then outputting the rows that were just sampled. When i run this code and attempt to access the dataframe, the resultant rows are offset from the row values provided by the "s" vector by between 1-5. What is causing these offsets? I was under the impression that by calling the rows specified by the "s" vector, the resultant output should include only those rows specified by the index of "s".

Please let me know what you think, thanks!

summary (Auto)
      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  

  acceleration        year           origin                      name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
 Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
 Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
 Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
                                                 (Other)           :365  
   mpg01        
 Mode :logical  
 FALSE:207      
 TRUE :185      

> AutoScale = scale (Auto[,-c(8,9,10)])
> s = sample (nrow (AutoScale), 10)
> s
 [1] 354   1 233  85 163 171 216 297 137  92
> AutoScale [s, ]
            mpg  cylinders displacement horsepower      weight acceleration       year
359  1.04472438 -0.8629108   -0.7110965 -0.7915944 -0.40332370    0.9999309  1.3628576
1   -0.69774672  1.4820530    1.0759146  0.6632851  0.61974833   -1.2836176 -1.6232409
235  0.13505197 -0.8629108   -0.4148541 -0.4278746 -0.27970740    0.1662545  0.2770036
86  -1.33836110  1.4820530    1.4868316  1.8323847  1.32141798   -0.9211496 -0.8088504
165 -0.31337809  0.3095711    0.3496427  0.1436853  0.07230472   -0.1962136 -0.2659234
173  0.19911341 -0.8629108   -0.9977828 -0.8695344 -0.88837051    0.3474885 -0.2659234
218  0.83972778 -0.8629108   -0.7971024 -0.6357145 -0.96842678   -0.2687072  0.2770036
299 -0.05713234  1.4820530    1.4868316  0.5333851  1.08595837    0.6737097  0.8199306
139 -1.21023822  1.4820530    1.1810329  1.1828849  1.74171339   -0.7399156 -0.5373869
93  -1.33836110  1.4820530    1.4963878  1.3907248  1.63104738   -0.9211496 -0.8088504

Solution

  • Everything here seems to be working. I think you are just confused by the row names.

    The first (unnamed) column that prints with a data.frame shows the row names, not the row index. The row names for a data.frame can be anything. For the Auto data set they mostly go in numeric order but there are some numbers missing. For example the 33rd row is labeled as 34 because there is no 33.

    head(rownames(Auto), 35)
     [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13"
    [14] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
    [27] "27" "28" "29" "30" "31" "32" "34" "35" "36"
                                    # ^ Note no 33
    

    So your sample is working just fine. If you needed them to match for some reason, you can sample by row name rather than row index

    s <- sample(rownames(AutoScale), 10)
    s
    AutoScale[s,]