I am trying to subsample a data.frame in a way that the sample would have observations that capture as much variation as possible among a set of columns of the original data.frame.
An example with the mtcars
dataset: I'd like to find 3 cars that are the most different from each other by mpg
, vs
and carb
. Looking at the data visually, it would probably be Toyota Corolla (high mpg
, vs
1, low carb
), Cadillac Fleetwood (low mpg
, vs
0, medium carb
) and either Maserati Bora (low-med mpg
, vs
0, high carb
) or Ferrari Dino (medium mpg
, vs
0, med-high carb
):
> mtcars[order(mtcars$mpg),]
mpg cyl disp hp drat wt qsec vs am gear carb
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
How would I do that programmatically, so that I get a smaller data.frame, e.g.
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
I was thinking that this can probably be solved with Latin Hypercube Sampling (lhs
package), but it seems to produce potential data points from a given distribution, rather than giving row indices of the actual dataset.
In addition, what if we also wanted to consider an unordered multi-level factor, e.g. a column brand
:
> mtcars$brand = factor(sapply(strsplit(rownames(mtcars), " "), function(x)x[[1]]))
> levels(mtcars$brand)
[1] "AMC" "Cadillac" "Camaro" "Chrysler" "Datsun" "Dodge"
[7] "Duster" "Ferrari" "Fiat" "Ford" "Honda" "Hornet"
[13] "Lincoln" "Lotus" "Maserati" "Mazda" "Merc" "Pontiac"
[19] "Porsche" "Toyota" "Valiant" "Volvo"
Would it need to be decomposed into dummy variables first?
As per the idea by Solarion, Principal Component Analysis can be used for this. The code for getting ncars
cars using the first principal component and the three columns is:
ncars = 3
df = mtcars[, c("mpg", "vs", "carb")]
PCA1 = prcomp(df, scale.=TRUE)$x[,1]
ResultRows = sort(PCA1)[ceiling(seq(1, length(PCA1), length.out=ncars))]
mtcars[rownames(mtcars) %in% names(ResultRows),]
#> mpg cyl disp hp drat wt qsec vs am gear carb brand
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Maserati
Created on 2021-03-17 by the reprex package (v1.0.0)
For the more complicated case of including brand
, we can use Factor Analysis for Mixed Data:
library(FactoMineR)
mtcars$brand = factor(sapply(strsplit(rownames(mtcars), " "), function(x)x[[1]]))
ncars = 3
df = mtcars[, c("mpg", "vs", "carb", "brand")]
PCA1 = FAMD(df, graph=FALSE)$ind$coord[,1]
ResultRows = sort(PCA1)[ceiling(seq(1, length(PCA1), length.out=ncars))]
mtcars[rownames(mtcars) %in% names(ResultRows),]
#> mpg cyl disp hp drat wt qsec vs am gear carb brand
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Maserati
Created on 2021-03-17 by the reprex package (v1.0.0)