Please see my code block below - in essence, I am using the Auto dataset from the ISLR library, scaling the dataframe's quantitative predictors, taking a random sample of the dataframe, and then outputting the rows that were just sampled. When i run this code and attempt to access the dataframe, the resultant rows are offset from the row values provided by the "s" vector by between 1-5. What is causing these offsets? I was under the impression that by calling the rows specified by the "s" vector, the resultant output should include only those rows specified by the index of "s".
Please let me know what you think, thanks!
summary (Auto)
mpg cylinders displacement horsepower weight
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
acceleration year origin name
Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
(Other) :365
mpg01
Mode :logical
FALSE:207
TRUE :185
> AutoScale = scale (Auto[,-c(8,9,10)])
> s = sample (nrow (AutoScale), 10)
> s
[1] 354 1 233 85 163 171 216 297 137 92
> AutoScale [s, ]
mpg cylinders displacement horsepower weight acceleration year
359 1.04472438 -0.8629108 -0.7110965 -0.7915944 -0.40332370 0.9999309 1.3628576
1 -0.69774672 1.4820530 1.0759146 0.6632851 0.61974833 -1.2836176 -1.6232409
235 0.13505197 -0.8629108 -0.4148541 -0.4278746 -0.27970740 0.1662545 0.2770036
86 -1.33836110 1.4820530 1.4868316 1.8323847 1.32141798 -0.9211496 -0.8088504
165 -0.31337809 0.3095711 0.3496427 0.1436853 0.07230472 -0.1962136 -0.2659234
173 0.19911341 -0.8629108 -0.9977828 -0.8695344 -0.88837051 0.3474885 -0.2659234
218 0.83972778 -0.8629108 -0.7971024 -0.6357145 -0.96842678 -0.2687072 0.2770036
299 -0.05713234 1.4820530 1.4868316 0.5333851 1.08595837 0.6737097 0.8199306
139 -1.21023822 1.4820530 1.1810329 1.1828849 1.74171339 -0.7399156 -0.5373869
93 -1.33836110 1.4820530 1.4963878 1.3907248 1.63104738 -0.9211496 -0.8088504
Everything here seems to be working. I think you are just confused by the row names.
The first (unnamed) column that prints with a data.frame shows the row names, not the row index. The row names for a data.frame can be anything. For the Auto
data set they mostly go in numeric order but there are some numbers missing. For example the 33rd row is labeled as 34 because there is no 33.
head(rownames(Auto), 35)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
[14] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[27] "27" "28" "29" "30" "31" "32" "34" "35" "36"
# ^ Note no 33
So your sample is working just fine. If you needed them to match for some reason, you can sample by row name rather than row index
s <- sample(rownames(AutoScale), 10)
s
AutoScale[s,]