Search code examples
rdplyrcorrelationnanumeric

Error in Correlation Matrix: why deleting NA still doesn't work?


I'm trying to make the correlation matrix

Here a sample of the dataset.

> head(matrix)
# A tibble: 6 x 16
# Groups:   nquest, nord [6]
  nquest  nord   sex anasc  ireg   eta staciv studio asnonoc2 nace2 nesplav etalav dislav acontrib occnow tpens
   <int> <int> <dbl> <int> <int> <int>  <int> <fct>     <int> <int> <fct>   <fct>  <fct>     <int>  <int> <int>
1    173     1     1  1948    18    72      3 2             2    19 1       2      0            35      2  1800
2   2886     1     1  1949    13    71      1 2             2    16 1       2      0            35      2  1211
3   2886     2     0  1952    13    68      1 3             2    17 1       2      0            42      2  2100
4   5416     1     0  1958     8    62      3 1             1    19 2       1      0            30      2   700
5   7886     1     1  1950     9    70      1 2             2    11 1       2      0            35      2  2000
6  20297     1     1  1960     5    60      1 1             1    19 2       1      0            39      2  1200

Actually, nquest and nord are identification codes: the first is for the family, the second for the member of that specific family. Even if I try to remove them (because I think they are useless in a correlation matrix), dplyr add them automatically

matrix <- final %>%
           select("sex", "anasc", "ireg", "eta","staciv", "studio", "asnonoc2", 
                  "nace2", "nesplav", "etalav", "dislav", "acontrib", "occnow",
                  "tpens")

Dplyr answers

Adding missing grouping variables: `nquest`, `nord`

However, I don't think it is a problem if they remain in the dataset.

My goal is to compute the correlation matrix, but this dataset seems to have some NA values

> sum(is.na(matrix))
[1] 109

I've tried these codes, but none of them works.

The first

cor(matrix, use = "pairwise.complete.obs")

R replies

Error in cor(matrix, use = "pairwise.complete.obs") : 
  'x' must be numeric

The second

cor(na.omit(matrix))

R answers

Error in cor(na.omit(matrix)) : 'x' must be numeric

I've also tried

matrix <- as.numeric(matrix)

But I get another kind of error

Error: 'list' object cannot be coerced to type 'double'

How can I solve? What am I doing wrong?


Solution

  • The problem might be in the type of your data columns. In your example some of your data columns are of type factor (indicated as <fct> in your data like studio for example). They are actually numeric but currently of factor type in your dataset for some unknown reasons. Therefore they are recognized by the cor() function as string type and not numeric resulted in throwing the error. So, you might need to convert your data type into their numeric format for correlation analysis. One option is to use type.convert(). If you have columns of type character (like string values) they must be removed from the data to be used for correlation analysis. Also as was suggested by commenters would be better to not use reserved names in R for your objects like matrix in your example. Here is my advise:

    # copy your data into dft
    dft<-matrix
    
    #return the type of variables into their actual type
    dft <- type.convert(dft,as.is=TRUE)
    
    # perform correlation excluding first two columns as you explained not informative
    cor(dft[,-c(1:2)],use = "pairwise.complete.obs")
    

    Hope it could helps