Search code examples
rregressionnon-linear-regression

Finding non-linear correlations in R


I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?

I have tried building a model like this (which I could do in a loop for each variable i = 2:90):

y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2

quadratic.model = lm(y ~ x + x2)

And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?

Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.


Solution

  • Another option would be to compute mutual information score between each pair of variables. For example, using the mutinformation function from the infotheo package, you could do:

    set.seed(1)
    
    library(infotheo)
    
    # corrleated vars (x & y correlated, z noise)
    x <- seq(-10,10, by=0.5)
    y <- x^2
    z <- rnorm(length(x))
    
    # list of vectors
    raw_dat <- list(x, y, z)
    
    
    # convert to a dataframe and discretize for mutual information
    dat <- matrix(unlist(raw_dat), ncol=length(raw_dat))
    dat <- discretize(dat)
    
    mutinformation(dat)
    

    Result:

    |   |        V1|        V2|        V3|                                                                                            
    |:--|---------:|---------:|---------:|                                                                                            
    |V1 | 1.0980124| 0.4809822| 0.0553146|                                                                                            
    |V2 | 0.4809822| 1.0943907| 0.0413265|                                                                                            
    |V3 | 0.0553146| 0.0413265| 1.0980124| 
    

    By default, mutinformation() computes the discrete empirical mutual information score between two or more variables. The discretize() function is necessary if you are working with continuous data transform the data to discrete values.

    This might be helpful at least as a first stab for looking for nonlinear relationships between variables, such as that described above.