I am trying to apply a rolling window regression model to multiple groups in my data. Part of my data is as below:
gvkey year LC YTO
1 001004 1972 0.1919713 2.021182
2 001004 1973 0.2275895 2.029056
3 001004 1974 0.3341368 2.053517
4 001004 1975 0.3313518 2.090532
5 001004 1976 0.4005829 2.136939
6 001004 1977 0.4471945 2.123909
7 001004 1978 0.4442004 2.150281
8 001004 1979 0.5054544 2.173162
9 001004 1980 0.5269449 2.188077
10 001004 1981 0.5423774 2.200805
11 001004 1982 0.3528982 2.200851
12 001004 1983 0.3674031 2.190487
13 001004 1984 0.2267620 2.181291
14 001004 1985 0.2796132 2.159443
15 001004 1986 0.3382120 2.128420
16 001004 1987 0.3214131 2.089670
17 001004 1988 0.3883732 2.048279
18 001004 1989 0.4466488 1.999539
19 001004 1990 0.4929991 1.955500
20 001004 1991 0.5150894 1.934893
21 001004 1992 0.5218845 1.925521
22 001004 1993 0.5038105 1.904241
23 001004 1994 0.5041639 1.881731
24 001004 1995 0.5196658 1.863143
25 001004 1996 0.5352994 1.844464
26 001004 1997 0.4556059 1.835676
27 001004 1998 0.4905767 1.837886
28 001004 1999 0.5471959 1.824636
29 001004 2000 0.5920976 1.814944
30 001004 2001 0.5998172 1.893943
31 001004 2002 0.4499911 1.889703
32 001004 2003 0.4207154 1.870703
33 001004 2004 0.4371594 1.831638
34 001004 2005 0.4525900 1.802684
35 001004 2006 0.4342149 1.781757
36 001004 2007 0.4899473 1.753360
37 001004 2008 0.5436673 1.680464
38 001004 2009 0.5873861 1.612499
39 001004 2010 0.5216734 1.544322
40 001004 2011 0.5592963 1.415892
41 001004 2012 0.5627509 1.407393
42 001004 2013 0.5904637 1.384202
43 001004 2014 0.6170085 1.353340
44 001004 2015 0.7145900 1.314014
45 001007 1975 0.3721916 2.090532
46 001007 1976 0.2760902 2.136939
47 001007 1977 0.1866554 2.123909
48 001007 1978 0.1977654 2.150281
49 001007 1979 0.1927100 2.173162
50 001007 1980 0.2112344 2.188077
51 001007 1981 -0.2141724 2.200805
52 001007 1982 -0.2072785 2.200851
53 001007 1983 -1.7406963 2.190487
54 001007 1984 -14.8071429 2.181291
55 001009 1982 -1.2753247 2.200851
56 001009 1983 1.3349904 2.190487
57 001009 1984 2.6192237 2.181291
58 001009 1985 0.5867925 2.159443
59 001009 1986 0.6959436 2.128420
60 001009 1987 0.7142857 2.089670
61 001009 1988 0.7771897 2.048279
62 001009 1989 0.8293820 1.999539
63 001009 1990 0.8655382 1.955500
64 001009 1991 0.8712144 1.934893
65 001009 1992 0.8882548 1.925521
66 001009 1993 0.9190540 1.904241
67 001009 1994 0.9411806 1.881731
68 001010 1971 0.6492499 2.002337
69 001010 1972 0.6667664 2.021182
70 001010 1973 0.6840115 2.029056
71 001010 1974 0.7011797 2.053517
72 001010 1975 0.7189469 2.090532
73 001010 1976 0.7367344 2.136939
74 001010 1977 0.7511779 2.123909
75 001010 1978 0.7673365 2.150281
76 001010 1979 0.7795880 2.173162
77 001010 1980 0.7824448 2.188077
78 001010 1981 0.7821913 2.200805
79 001010 1982 0.7646078 2.200851
80 001010 1983 0.7426172 2.190487
81 001010 1984 -0.0657935 2.181291
82 001010 1985 0.2802410 2.159443
83 001010 1986 0.2052373 2.128420
84 001010 1987 0.2465290 2.089670
85 001010 1988 0.3437856 2.048279
86 001010 1989 0.7398662 1.999539
87 001010 1990 0.6360582 1.955500
88 001010 1991 0.7790707 1.934893
89 001010 1992 0.7588472 1.925521
90 001010 1993 0.7695341 1.904241
91 001010 1994 0.8060759 1.881731
92 001010 1995 0.8381234 1.863143
93 001010 1996 0.8661541 1.844464
94 001010 1997 0.8700456 1.835676
95 001010 1998 0.8748443 1.837886
96 001010 1999 0.8884077 1.824636
97 001010 2000 0.8979903 1.814944
98 001010 2003 0.6812689 1.870703
99 001011 1983 0.3043007 2.190487
100 001011 1984 0.3080601 2.181291
My function is
Match.LC.YTO<-function(x){rollapplyr(x,width=10,by.column=F,fill=NA, FUN=function(m){
temp.1<-lm(LC~YTO,data=m)
summary(temp.1)$r.squared*(sign(summary(temp.1)$coefficients[2,1]))
})}
df<-df%>%group_by(gvkey)%>%mutate(MTCH=Match.LC.YTO(df))
My data is grouped by gvkey, and for each group I need to calculate a variable named "MTCH" which equals the R squared value times the sign of coefficient of YTO in the linear model LC~YTO
, and the model is estimated at a rolling window of 10 observations. I got the error message:
Error in mutate_impl(.data, dots) :
'data' must be a data.frame, not a matrix or an array
I have checked many other posts concerning the function rollapply and rollapplyr, and some suggest that I need to convert my df to zoo or matrix before I use rollapply function, but it still did not work.
rollapply
in zoo will accept plain matrix and data frame arguments. That is not the problem. The following are problems with this code:
the code passes a matrix to lm
but lm
takes a data.frame
the code attempts to use rollapply
with width of 10 on an object with fewer than 10 rows in the last group
if the intercept fits perfectly then there will be no 2nd coefficient from lm
so the reference to coefficients[2, 1]
will fail with an error.
Although not erroneous the following are areas for improvement:
TRUE
and FALSE
should be written out in full since T
and F
are valid variable names making this highly error-prone.
when using group_by
in dplyr always match it with an ungroup
. If you don't do that then the output will remember the grouping and the next time you use the output you will get a surprise. For example, consider the differnce between the following two snippets. The first results in n
being the number of elements in the group that that row belongs to whereas the second results in the n
being the number of rows in out
.
out <- df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO(LC, YTO))
out %>% mutate(n = n())
out <- df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO(LC, YTO)) %>% ungroup
out %>% mutate(n = n())
questions to SO should be self-contained and reproducible so the library
statements should not be omitted and the data should be provided in a reproducible manner
To fix these problems we
use partial = TRUE
in rollapply
to allow it to pass objects with fewer than 10 rows.
pass the variables involved directly
rollapply
over the row numbers.
add an NA
to the end of the coefficients to be picked up if the coefficient vector otherwise has only 1 element.
for clarity we have separated out the lm_summary
function which was anonymous in the question
for reproduciblity we have added library
statements and the Note at the end
The revised code is:
library(dplyr)
library(zoo)
Match.LC.YTO <- function(LC, YTO) {
lm_summary <- function(ix) {
temp.1 <- lm(LC ~ YTO, subset = ix)
summary(temp.1)$r.squared * sign(c(coef(temp.1), NA)[2])
}
rollapplyr(seq_along(LC), width = 10, FUN = lm_summary, partial = TRUE)
}
df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO(LC, YTO)) %>% ungroup
If you would rather use fill = NA
insted of partial = TRUE
then add a check for the series length being less than the series width, i.e. less than 10:
Match.LC.YTO2 <- function(LC, YTO) {
lm_summary <- function(ix) {
temp.1 <- lm(LC ~ YTO, subset = ix)
summary(temp.1)$r.squared * sign(c(coef(temp.1), NA)[2])
}
if (length(LC) < 10) return(NA) ##
rollapplyr(seq_along(LC), width = 10, FUN = lm_summary, fill = NA)
}
df %>% group_by(gvkey) %>% mutate(MTCH = Match.LC.YTO2(LC, YTO)) %>% ungroup
For sake of reproducibility we used this as the input df
:
Lines <- " gvkey year LC YTO
1 001004 1972 0.1919713 2.021182
2 001004 1973 0.2275895 2.029056
3 001004 1974 0.3341368 2.053517
4 001004 1975 0.3313518 2.090532
5 001004 1976 0.4005829 2.136939
6 001004 1977 0.4471945 2.123909
7 001004 1978 0.4442004 2.150281
8 001004 1979 0.5054544 2.173162
9 001004 1980 0.5269449 2.188077
10 001004 1981 0.5423774 2.200805
11 001004 1982 0.3528982 2.200851
12 001004 1983 0.3674031 2.190487
13 001004 1984 0.2267620 2.181291
14 001004 1985 0.2796132 2.159443
15 001004 1986 0.3382120 2.128420
16 001004 1987 0.3214131 2.089670
17 001004 1988 0.3883732 2.048279
18 001004 1989 0.4466488 1.999539
19 001004 1990 0.4929991 1.955500
20 001004 1991 0.5150894 1.934893
21 001004 1992 0.5218845 1.925521
22 001004 1993 0.5038105 1.904241
23 001004 1994 0.5041639 1.881731
24 001004 1995 0.5196658 1.863143
25 001004 1996 0.5352994 1.844464
26 001004 1997 0.4556059 1.835676
27 001004 1998 0.4905767 1.837886
28 001004 1999 0.5471959 1.824636
29 001004 2000 0.5920976 1.814944
30 001004 2001 0.5998172 1.893943
31 001004 2002 0.4499911 1.889703
32 001004 2003 0.4207154 1.870703
33 001004 2004 0.4371594 1.831638
34 001004 2005 0.4525900 1.802684
35 001004 2006 0.4342149 1.781757
36 001004 2007 0.4899473 1.753360
37 001004 2008 0.5436673 1.680464
38 001004 2009 0.5873861 1.612499
39 001004 2010 0.5216734 1.544322
40 001004 2011 0.5592963 1.415892
41 001004 2012 0.5627509 1.407393
42 001004 2013 0.5904637 1.384202
43 001004 2014 0.6170085 1.353340
44 001004 2015 0.7145900 1.314014
45 001007 1975 0.3721916 2.090532
46 001007 1976 0.2760902 2.136939
47 001007 1977 0.1866554 2.123909
48 001007 1978 0.1977654 2.150281
49 001007 1979 0.1927100 2.173162
50 001007 1980 0.2112344 2.188077
51 001007 1981 -0.2141724 2.200805
52 001007 1982 -0.2072785 2.200851
53 001007 1983 -1.7406963 2.190487
54 001007 1984 -14.8071429 2.181291
55 001009 1982 -1.2753247 2.200851
56 001009 1983 1.3349904 2.190487
57 001009 1984 2.6192237 2.181291
58 001009 1985 0.5867925 2.159443
59 001009 1986 0.6959436 2.128420
60 001009 1987 0.7142857 2.089670
61 001009 1988 0.7771897 2.048279
62 001009 1989 0.8293820 1.999539
63 001009 1990 0.8655382 1.955500
64 001009 1991 0.8712144 1.934893
65 001009 1992 0.8882548 1.925521
66 001009 1993 0.9190540 1.904241
67 001009 1994 0.9411806 1.881731
68 001010 1971 0.6492499 2.002337
69 001010 1972 0.6667664 2.021182
70 001010 1973 0.6840115 2.029056
71 001010 1974 0.7011797 2.053517
72 001010 1975 0.7189469 2.090532
73 001010 1976 0.7367344 2.136939
74 001010 1977 0.7511779 2.123909
75 001010 1978 0.7673365 2.150281
76 001010 1979 0.7795880 2.173162
77 001010 1980 0.7824448 2.188077
78 001010 1981 0.7821913 2.200805
79 001010 1982 0.7646078 2.200851
80 001010 1983 0.7426172 2.190487
81 001010 1984 -0.0657935 2.181291
82 001010 1985 0.2802410 2.159443
83 001010 1986 0.2052373 2.128420
84 001010 1987 0.2465290 2.089670
85 001010 1988 0.3437856 2.048279
86 001010 1989 0.7398662 1.999539
87 001010 1990 0.6360582 1.955500
88 001010 1991 0.7790707 1.934893
89 001010 1992 0.7588472 1.925521
90 001010 1993 0.7695341 1.904241
91 001010 1994 0.8060759 1.881731
92 001010 1995 0.8381234 1.863143
93 001010 1996 0.8661541 1.844464
94 001010 1997 0.8700456 1.835676
95 001010 1998 0.8748443 1.837886
96 001010 1999 0.8884077 1.824636
97 001010 2000 0.8979903 1.814944
98 001010 2003 0.6812689 1.870703
99 001011 1983 0.3043007 2.190487
100 001011 1984 0.3080601 2.181291"
df <- read.table(text = Lines)
The check for length in the line marked with ## at the end is no longer necessary as recent versions of zoo automatically make this check.