Search code examples
rquantmod

glm function causes a strange change in data frame


I'm working on a data set of IBM by using quantmod. I created two variables and then I used the glm function to see the relation between the two of them. The code ran good but then I noticed that part of the data frame contains NAs. How can I overcome this issue? Here is my code:

library("quantmod")
getSymbols("IBM")
dim(IBM)
IBM$CurrtDay_up <- ifelse(IBM$IBM.Open < IBM$IBM.Close,1,0)
IBM$LastDay_green <- ifelse((lag(IBM$IBM.Open,k=1) < lag(IBM$IBM.Close,k=1)),1,0)
head(IBM)
           IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green
2007-01-03    97.18    98.40   96.26     97.27    9196800     82.78498           1            NA
2007-01-04    97.25    98.79   96.88     98.31   10524500     83.67011           1             1
2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1
2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0
2007-01-09    99.08   100.33   99.07    100.07   11108200     85.16802           1             1
2007-01-10    98.50    99.05   97.93     98.89    8744800     84.16374           1             1

then I added the glm function:

IBM_1 <- IBM[3:1000,] # to avoid the first row's NA.
glm_greenDay <- glm(CurrtDay_up~LastDay_green,data=IBM_1,family=binomial(link='logit'))
IBM_1$glm_pred<-predict(glm_greenDay,newdata=IBM_1,type='response')
head(IBM_1)
           IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green  glm_pred
2007-01-04       NA       NA      NA        NA         NA           NA          NA            NA 0.5683453
2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1        NA
2007-01-07       NA       NA      NA        NA         NA           NA          NA            NA 0.5407240
2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0        NA
2007-01-08       NA       NA      NA        NA         NA           NA          NA            NA 0.5683453
2007-01-09    99.08   100.33   99.07    100.07   11108200     85.16802           1             1        NA

UPDATED CODE (please notice that one row (row # 2) has been duplicated: :

 IBM_1<-IBM[complete.cases(IBM[1:1000,]),] # to evoid the first row's NA.
 glm_greenDay<-glm(CurrtDay_up~LastDay_green,data=IBM_1,family=binomial(link='logit'))
 IBM_1$glm_pred<-glm_greenDay$fitted.values
 head(IBM_1)
           IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green  glm_pred
2007-01-03       NA       NA      NA        NA         NA           NA          NA            NA 0.5691203
2007-01-04    97.25    98.79   96.88     98.31   10524500     83.67011           1             1        NA
2007-01-04       NA       NA      NA        NA         NA           NA          NA            NA 0.5691203
2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1        NA
2007-01-07       NA       NA      NA        NA         NA           NA          NA            NA 0.5407240
2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0        NA

Solution

  • The problem is arising because the output of predict() is not an xts class object. The slots in the vector of predicted values have dates for names, but the vector is still just a vector without time indexing. I was able to get a simple call to merge() to work without dropping NAs before modeling by converting the output of predict() to class xts first:

    library(quantmod)
    getSymbols("IBM")
    IBM$CurrtDay_up <- ifelse(IBM$IBM.Open < IBM$IBM.Close, 1, 0)
    IBM$LastDay_green <- ifelse((lag(IBM$IBM.Open, k=1) < lag(IBM$IBM.Close, k=1)), 1, 0)
    glm_greenDay <- glm(CurrtDay_up~LastDay_green, data=IBM, family=binomial(link='logit'), na.action=na.exclude)
    glm_pred <- predict(glm_greenDay, type='response')
    glm_pred_xts <- xts(x = glm_pred, order.by = as.Date(names(glm_pred)))
    IBM2 <- merge(IBM, glm_pred_xts)
    

    That seems to produce the desired output:

    > head(glm_pred)
    2007-01-03 2007-01-04 2007-01-05 2007-01-08 2007-01-09 2007-01-10 
            NA  0.5383952  0.5383952  0.5383065  0.5383952  0.5383952 
    
    > head(IBM2)
               IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green glm_pred_xts
    2007-01-03    97.18    98.40   96.26     97.27    9196800     82.78498           1            NA           NA
    2007-01-04    97.25    98.79   96.88     98.31   10524500     83.67011           1             1    0.5383952
    2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1    0.5383952
    2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0    0.5383065
    2007-01-09    99.08   100.33   99.07    100.07   11108200     85.16802           1             1    0.5383952
    2007-01-10    98.50    99.05   97.93     98.89    8744800     84.16374           1             1    0.5383952