Search code examples
rpca

PCA - how to visualize that all the variable are in different / same scale


I am working with the dataset uscrime but this question applied to any well-known dataset like cars. After to googling I found extremely useful to standardize my data, considering that PCA finds new directions based on covariance matrix of original variables, and covariance matrix is sensitive to standardization of variables.

Nevertheless, I found "It is not necessary to standardize the variables, if all the variables are in same scale."

To standardize the variable I am using the function: z_uscrime <- (uscrime - mean(uscrime)) / sd(uscrime)

Prior to standardize my data, how to check if all the variables are in the same scale or not?


Solution

  • Proving my point that you can standardize your data however many times you want

    library(tidyverse)
    library(recipes)
    #> 
    #> Attaching package: 'recipes'
    #> The following object is masked from 'package:stringr':
    #> 
    #>     fixed
    #> The following object is masked from 'package:stats':
    #> 
    #>     step
    
    
    simple_recipe <- recipe(mpg ~ .,data = mtcars) %>% 
      step_center(everything()) %>% 
      step_scale(everything())
    
    
    mtcars2 <- simple_recipe %>% 
      prep() %>%
      juice()
    
    simple_recipe2 <- recipe(mpg ~ .,data = mtcars2) %>% 
      step_center(everything()) %>% 
      step_scale(everything())
    
    mtcars3 <- simple_recipe2 %>% 
      prep() %>%
      juice()
    
    all.equal(mtcars2,mtcars3)
    #> [1] TRUE
    
    mtcars2 %>%
      summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>% 
      pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
    #> # A tibble: 11 x 3
    #>    stat       mean    sd
    #>    <chr>     <dbl> <dbl>
    #>  1 cyl   -1.47e-17  1   
    #>  2 disp  -9.08e-17  1   
    #>  3 hp     1.04e-17  1   
    #>  4 drat  -2.92e-16  1   
    #>  5 wt     4.68e-17  1.00
    #>  6 qsec   5.30e-16  1   
    #>  7 vs     6.94e-18  1.00
    #>  8 am     4.51e-17  1   
    #>  9 gear  -3.47e-18  1.00
    #> 10 carb   3.17e-17  1.00
    #> 11 mpg    7.11e-17  1
    
    mtcars3 %>%
      summarise(across(everything(),.fns = list(mean = ~ mean(.x),sd = ~sd(.x)))) %>% 
      pivot_longer(everything(),names_pattern = "(.*)_(.*)",names_to = c("stat", ".value"))
    #> # A tibble: 11 x 3
    #>    stat       mean    sd
    #>    <chr>     <dbl> <dbl>
    #>  1 cyl   -1.17e-17     1
    #>  2 disp  -1.95e-17     1
    #>  3 hp     9.54e-18     1
    #>  4 drat   1.17e-17     1
    #>  5 wt     3.26e-17     1
    #>  6 qsec   1.37e-17     1
    #>  7 vs     4.16e-17     1
    #>  8 am     4.51e-17     1
    #>  9 gear   0.           1
    #> 10 carb   2.60e-18     1
    #> 11 mpg    4.77e-18     1
    

    Created on 2020-06-07 by the reprex package (v0.3.0)