Search code examples
rregressionlinear-regressionp-valuetrend

Finding p-value and z statistics along with the OLS Linear regression


I could find the coefficients and intercepts from linear regression but unable to find a suitable method to get p-value and z value for respective variable trend. Additionally, not able to find a method to save the output results in excel format. The data is here. There are 24 variables against time. I am not getting the z-statistics and p-values, additionally estimates are also incorrect by first method. where am i wrong?

library("trend")

# read ozone data (I converted to a text file first)
otm <- read.table("D:/data.txt",header=T)


#  make a data frame version
otm_df <- data.frame(otm)
markers <- sample(0:1, replace = T, size = 11)

# calculate OLS slope for all columns
# the -1 at end removes the intercepts
ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])

I tried this method. I didn't get the z-statistics and could not save it to excel format.

library(reshape2)
DF <- reshape2::melt(otm, id.var = "Year")
library(broom); library(tidyverse)
ols <- DF %>% nest(data = -variable) %>% 
  mutate(model = map(data, ~lm(value ~ Year, data = .)), 
         tidied = map(model, tidy)) %>% 
  unnest(tidied)

#to save the results in excel format (not working here for me)
capture.output(summary(ols), file = "ols.csv" )
write.csv(ols, file.path('E:/',filename = "ols2.csv"), row.names = TRUE) 
# A tibble: 48 x 8
   variable data              model  term         estimate std.error statistic p.value
   <fct>    <list>            <list> <chr>           <dbl>     <dbl>     <dbl>   <dbl>
 1 BanTES   <tibble [11 x 2]> <lm>   (Intercept) -236.       488.       -0.483   0.641
 2 BanTES   <tibble [11 x 2]> <lm>   Year           0.139      0.242     0.572   0.582
 3 SriTES   <tibble [11 x 2]> <lm>   (Intercept)  220.       351.        0.627   0.546
 4 SriTES   <tibble [11 x 2]> <lm>   Year          -0.0935     0.174    -0.536   0.605
 5 AfgTES   <tibble [11 x 2]> <lm>   (Intercept)  364.       444.        0.820   0.434
 6 AfgTES   <tibble [11 x 2]> <lm>   Year          -0.161      0.221    -0.730   0.484
 7 BhuTES   <tibble [11 x 2]> <lm>   (Intercept)  373.       831.        0.449   0.664
 8 BhuTES   <tibble [11 x 2]> <lm>   Year          -0.170      0.413    -0.412   0.690
 9 IndTES   <tibble [11 x 2]> <lm>   (Intercept) -342.       213.       -1.60    0.143
10 IndTES   <tibble [11 x 2]> <lm>   Year           0.190      0.106     1.80    0.106 
summary(ols)
    variable  data.Length  data.Class  data.Mode model.Length  model.Class  model.Mode     term          
 BanTES : 2   2       tbl_df  list               12    lm    list                      Length:48         
 SriTES : 2   2       tbl_df  list               12    lm    list                      Class :character  
 AfgTES : 2   2       tbl_df  list               12    lm    list                      Mode  :character  
 BhuTES : 2   2       tbl_df  list               12    lm    list                                        
 IndTES : 2   2       tbl_df  list               12    lm    list                                        
 NepTES : 2   2       tbl_df  list               12    lm    list                                        
 (Other):36   2       tbl_df  list               12    lm    list  

Any help will be useful. Thank you in advance !


Solution

  • Linear regression does not give you Z statistic as rightly commented by @Roland rather linear regression gives you t statistic. You can use the following code to save the coeff, t statistic, and p-value in excel format

    library(tidyverse)
    library(broom)
    library(openxlsx)
    
    # read ozone data (I converted to a text file first)
    otm <- read.table("data.txt", header=T)
    head(otm, 2)
    
    otm %>% 
      pivot_longer(-c("Year")) %>%
      group_by(name) %>% 
      do(fitlm = tidy(lm(value ~ Year, data = ., na.action=na.omit))) %>% 
      unnest(fitlm) %>% 
      write.xlsx("Linear_trend_1.xlsx")
    

    In the "Linear_trend_1.xlsx" file, Year row for each location provides you the slope and the statistic is t statistic.

    Your values from the first method are not matching with the second method because the first method uses markers as the dependent variable and all the variables in the otm_df were used as the independent variables. In the second method, Year was used as independent variable while all the location values were the dependent variable.
    Another thing, you are using sample function to create markers which will give you different outputs for every run. So you have to use set.seed to make it constant for every run like

    set.seed(123)
    otm %>% 
      mutate(markers = sample(0:1, replace = T, size = 11)) %>% 
      pivot_longer(-c("Year", "markers")) %>%
      group_by(name) %>% 
      do(fitlm = tidy(lm(markers ~ value, data = ., na.action=na.omit))) %>% 
      unnest(fitlm) %>% 
      write.xlsx("Linear_trend_2.xlsx")
    
    # A tibble: 48 x 6
       name   term        estimate std.error statistic p.value
       <chr>  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
     1 AfgERA (Intercept)  -8.58      7.79      -1.10    0.299
     2 AfgERA value         0.577     0.498      1.16    0.276
     3 AfgTES (Intercept)  -1.62      2.99      -0.542   0.601
     4 AfgTES value         0.0521    0.0750     0.695   0.505
     5 AfgTOM (Intercept) -25.3      19.5       -1.30    0.225
     6 AfgTOM value         1.94      1.46       1.33    0.218
     7 BanERA (Intercept)  -2.28      5.03      -0.453   0.662
     8 BanERA value         0.201     0.370      0.543   0.600
     9 BanTES (Intercept)  -3.42      2.80      -1.22    0.253
    10 BanTES value         0.0892    0.0644     1.39    0.199
    # ... with 38 more rows
    

    If I modify your first approach like the following, you can see that the output from the above code and your code is basically same

    #  make a data frame version
    otm_df <- data.frame(otm)
    set.seed(123) #To have same output from sample function for every run
    markers <- sample(0:1, replace = T, size = 11)
    
    # calculate OLS slope for all columns
    # the -1 at end removes the intercepts
    ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])
    
    Year.x     BanTES.x     SriTES.x     AfgTES.x     BhuTES.x     IndTES.x 
     0.054545455  0.089176876  0.092629314  0.052132553  0.001602003  0.107446434 
        NepTES.x     PakTES.x      SATES.x     BanERA.x     SriERA.x     AfgERA.x 
     0.065125607 -0.115108438  0.315928753  0.200712285 -0.519996739  0.577317012 
        BhuERA.x     IndERA.x     NepERA.x     PakERA.x      SAERA.x     BanTOM.x 
    -0.140990921  0.110578204 -0.030546686  0.265909319  0.176797510  1.897234338 
        SriTOM.x     AfgTOM.x     BhuTOM.x     IndTOm.x     NepTOM.x     PakTOM.x 
     5.380610477  1.935170281  1.761172711  2.248531891  2.107380452  2.011356580 
         SATOM.x 
     2.214848959 
    

    Here is the data in dput format

    dput(otm)
    structure(list(Year = 2008:2018, BanTES = c(44.06247376, 43.81239107, 
    40.68010622, 46.97760506, 37.49591135, 43.81239107, 43.81239107, 
    43.81239107, 43.81239107, 45.27803189, 44.06247376), SriTES = c(32.07268265, 
    35.01918477, 29.91018035, 34.1291577, 28.5258431, 32.07268265, 
    32.07268265, 32.07268265, 32.07268265, 30.96753552, 32.07268265
    ), AfgTES = c(39.19328867, 42.06325898, 42.31015918, 40.54543762, 
    34.28696385, 40.54543762, 40.54543762, 40.54543762, 40.54543762, 
    37.38974643, 39.19328867), BhuTES = c(34.08241824, 36.95440954, 
    30.41561338, 30.37004394, 19.8861367, 30.41561338, 30.41561338, 
    30.41561338, 30.41561338, 32.09933763, 32.09933763), IndTES = c(40.05913352, 
    41.54741392, 38.88957844, 42.47544504, 43.24350644, 41.54741392, 
    41.54741392, 41.54741392, 41.54741392, 42.65820983, 42.47544504
    ), NepTES = c(38.12979871, 37.62785275, 34.40488247, 37.7995467, 
    39.64286364, 37.7995467, 37.7995467, 37.7995467, 37.7995467, 
    30.63632105, 37.7995467), PakTES = c(41.38388734, 41.99865359, 
    42.16236093, 42.51838941, 43.4444952, 42.16236093, 42.16236093, 
    42.16236093, 42.16236093, 44.96627251, 42.51838941), SATES = c(40.03077, 
    41.52302, 39.6327, 41.9098, 41.11191, 41.11191, 41.11191, 41.11191, 
    41.11191, 41.57009, 41.52302), BanERA = c(13.76686693, 13.904453, 
    13.40584856, 13.45199721, 13.47657436, 12.8992102, 13.3586098, 
    14.23223365, 13.4228729, 13.21487616, 14.50830571), SriERA = c(11.81852768, 
    11.79187354, 11.51484349, 11.50552588, 11.489789, 11.23384852, 
    10.61182708, 11.33951759, 11.6357584, 11.74685028, 12.14987906
    ), AfgERA = c(15.44115983, 15.425995, 15.6161623, 15.47751927, 
    15.81748069, 15.47498417, 15.41748855, 16.06462541, 15.61143062, 
    15.32810621, 16.39162424), BhuERA = c(14.34493453, 14.28085419, 
    14.24728543, 14.03202106, 14.04152053, 13.42977221, 13.22665229, 
    14.344052, 13.58792484, 13.28851619, 14.28029524), IndERA = c(14.08262362, 
    14.11485037, 13.80713493, 13.86114379, 13.92607879, 13.37996473, 
    13.45767152, 14.49365275, 13.88142768, 13.73986257, 14.77032404
    ), NepERA = c(14.93883379, 14.896056, 14.78607828, 14.50880606, 
    14.69793299, 13.96309811, 14.18825383, 15.32530354, 14.38700954, 
    13.98545482, 14.9828434), PakERA = c(14.89773191, 14.87337075, 
    14.89837223, 14.76236826, 15.13918051, 14.61385609, 14.54589641, 
    15.40150813, 14.8588883, 14.62185208, 15.6575491), SAERA = c(14.38877, 
    14.40468, 14.22069, 14.20561, 14.35855, 13.8399, 13.88027, 14.83054, 
    14.24554, 14.06201, 15.09615), BanTOM = c(9.317937851, 9.308046341, 
    9.327401161, 9.319338799, 9.311285019, 9.300317764, 9.292790413, 
    9.540946007, 9.04840374, 9.300317764, 9.317937851), SriTOM = c(5.437336445, 
    5.436554909, 5.435492039, 5.440690994, 5.436693192, 5.440601349, 
    5.427892685, 5.54946661, 5.427827358, 5.440601349, 5.437336445
    ), AfgTOM = c(13.31581736, 13.30339324, 13.30090284, 13.29781604, 
    13.33800817, 13.31919873, 13.30073023, 13.62503445, 13.16488469, 
    13.30073023, 13.31581736), BhuTOM = c(11.69911337, 11.67375898, 
    11.71142626, 11.69099903, 11.68556881, 11.68714046, 11.65387106, 
    11.97064924, 11.44872904, 11.67375898, 11.69099903), IndTOm = c(9.709311898, 
    9.704142364, 9.703938368, 9.72520479, 9.709638531, 9.708799697, 
    9.690851817, 9.952517961, 9.499369441, 9.704142364, 9.709638531
    ), NepTOM = c(12.45835066, 12.43677187, 12.48850822, 12.49002218, 
    12.46283817, 12.50376368, 12.44072294, 12.78685617, 12.27891684, 
    12.44072294, 12.46283817), PakTOM = c(12.37911913, 12.38028261, 
    12.37067625, 12.38315158, 12.38352468, 12.36856567, 12.37349086, 
    12.67422019, 12.18962786, 12.37349086, 12.38315158), SATOM = c(10.63543, 
    10.62967, 10.62981, 10.64489, 10.63941, 10.63525, 10.6195, 10.89613, 
    10.44028, 10.62967, 10.63941)), class = "data.frame", row.names = c(NA, 
    -11L))