Search code examples
rlinear-regressionspsspearson-correlation

Difference between R and SPSS linear model results


I'm a beginner at statistics. Currently attending an introductory course, which uses . I've been trying to learn at the same time, and so far I've consistently been getting the same results, for calculations with both tools, As expected.

However, we're currently doing correlations (Pearson's Rho), and fitting linear models, and I'm consistently getting different results between R and SPSS.

The dataset is GSS2012.zip in this zip-file.

d = GSS2012$tolerance
e = GSS2012$age
f = GSS2012$polviews
g = GSS2012$educ

SPSS    R   std. error (SPSS)  
intercept   6,694   7,29707726  0,623  
e   -0,031  -0,03130627 0,006  
f   -0,123  -0,20586503 0,072  
g   0,411   0,40029541  0,033  

Full, minimal working examples to get the results above, are found below.

I've tried different use="stuff" for cor; didn't make difference.

cor(d, e, use = "pairwise.complete.obs")

Full, minimal working example for lm:

> library(haven)
> GSS2012 <- read_sav("full version/GSS2012.sav")
> lm(GSS2012$tolerance ~ GSS2012$age + GSS2012$polviews + GSS2012$educ, na.action="na.exclude", singular.ok = F)

Call:
lm(formula = GSS2012$tolerance ~ GSS2012$age + GSS2012$polviews + 
    GSS2012$educ, na.action = "na.exclude", singular.ok = F)

Coefficients:
     (Intercept)       GSS2012$age  GSS2012$polviews      GSS2012$educ  
         7.29708          -0.03131          -0.20587           0.40030  

Nothing has so far given me the same values as SPSS. ---Not that I know the latter are necessarily correct, I'd just like to replicate the results.

SPSS script:

DATASET ACTIVATE DataSet1. 
REGRESSION 
  /MISSING LISTWISE 
  /STATISTICS COEFF OUTS R ANOVA 
  /CRITERIA=PIN(.05) POUT(.10) 
  /NOORIGIN 
  /DEPENDENT tolerance 
  /METHOD=ENTER age polviews educ.

Articles like these are probably related: link1; link2; link3, but I haven't been able to use the information therein to replicate the SPSS data. (Again, R might have more accurate results; I don't know. But I'm in "an SPSS environment", and thus it would be good if I'd be able to get the same results for now :)


Solution

  • This is only a partial answer as I can see what the problem is, although I'm not sure what is causing it. The issue is to do with missing values and how they're handled in the SPSS file. Lets just take the educ variable as an example...

    In the SPSS file you can see that the values 97, 98, and 99 are defined as missing values:

    enter image description here

    If you sort the SPSS file by the educ column, you can see there are 2 data rows with these missing values. They are IDs 837 and 1214:

    enter image description here

    In R, you can confirm that those rows do infact contain missing values (NA):

    > which(is.na(GSS2012$educ))
    [1]  837 1214
    

    The problem is in SPSS, when you actually tell it to count how many rows are missing, it says theres only 1 missing data row:

    FREQUENCIES VARIABLES=educ 
      /FORMAT=NOTABLE
      /ORDER= ANALYSIS .
    

    enter image description here

    The problem is with ID 1214. SPSS is not considering that 99 value for 1214 to be missing. For example, try changing educ for 837 to any other (non-missing) number, and you'll see that SPSS says there are 0 missing rows for educ, when in fact 1214 should still be missing (99)

    I haven't checked, but I'm guessing a similar thing is happening to a number of rows for the polviews variable.

    The consequence of this is that SPSS isn't treating those rows as missing data when you run the analysis, but in R those values are correctly set as missing and omitted. In other words, SPSS is using more data for the model than it should be using. You can confirm this by looking at the SPSS and R output - the degrees of freedom are different across the 2 programs, which then leads to a (slight) difference in results

    I'm not sure why SPSS isnt treating those rows as missing. It could either be a bug (wouldn't be the first for SPSS...) or something to do with the way the file has been set up. I haven't checked the latter because its a big file and I'm not familiar enough with the dataset to know where to look