Search code examples
rcorrelationsampling

How to sample a data.frame to minimise correlation between selected columns?


I am trying to subsample a data.frame in a way that the sample would have observations that capture as much variation as possible among a set of columns of the original data.frame.

An example with the mtcars dataset: I'd like to find 3 cars that are the most different from each other by mpg, vs and carb. Looking at the data visually, it would probably be Toyota Corolla (high mpg, vs 1, low carb), Cadillac Fleetwood (low mpg, vs 0, medium carb) and either Maserati Bora (low-med mpg, vs 0, high carb) or Ferrari Dino (medium mpg, vs 0, med-high carb):

> mtcars[order(mtcars$mpg),]
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

How would I do that programmatically, so that I get a smaller data.frame, e.g.

Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

I was thinking that this can probably be solved with Latin Hypercube Sampling (lhs package), but it seems to produce potential data points from a given distribution, rather than giving row indices of the actual dataset.

In addition, what if we also wanted to consider an unordered multi-level factor, e.g. a column brand:

> mtcars$brand = factor(sapply(strsplit(rownames(mtcars), " "), function(x)x[[1]]))
> levels(mtcars$brand)
 [1] "AMC"      "Cadillac" "Camaro"   "Chrysler" "Datsun"   "Dodge"   
 [7] "Duster"   "Ferrari"  "Fiat"     "Ford"     "Honda"    "Hornet"  
[13] "Lincoln"  "Lotus"    "Maserati" "Mazda"    "Merc"     "Pontiac" 
[19] "Porsche"  "Toyota"   "Valiant"  "Volvo"

Would it need to be decomposed into dummy variables first?


Solution

  • As per the idea by Solarion, Principal Component Analysis can be used for this. The code for getting ncars cars using the first principal component and the three columns is:

    ncars = 3
    df = mtcars[, c("mpg", "vs", "carb")]
    
    PCA1 = prcomp(df, scale.=TRUE)$x[,1]
    ResultRows = sort(PCA1)[ceiling(seq(1, length(PCA1), length.out=ncars))]
    mtcars[rownames(mtcars) %in% names(ResultRows),]
    #>                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb    brand
    #> Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Hornet
    #> Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1   Toyota
    #> Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8 Maserati
    

    Created on 2021-03-17 by the reprex package (v1.0.0)

    For the more complicated case of including brand, we can use Factor Analysis for Mixed Data:

    library(FactoMineR)
    
    mtcars$brand = factor(sapply(strsplit(rownames(mtcars), " "), function(x)x[[1]]))
    
    ncars = 3
    df = mtcars[, c("mpg", "vs", "carb", "brand")]
    
    PCA1 = FAMD(df, graph=FALSE)$ind$coord[,1]
    ResultRows = sort(PCA1)[ceiling(seq(1, length(PCA1), length.out=ncars))]
    mtcars[rownames(mtcars) %in% names(ResultRows),]
    #>                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb    brand
    #> Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Hornet
    #> Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1   Toyota
    #> Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8 Maserati
    

    Created on 2021-03-17 by the reprex package (v1.0.0)