I am working with a big data base and I am trying to chunk a balance panel in balanced panels so I can perform a plm
loop regression.
I have been trying to divide my data frame bb3
(balance grouping) but I can't manage to get what I want, this is my sample data bb3
in my drive as a .RData
First attempt
Following this question I tried using split.default
:
n<-6
f<-gl(n,nrow(bb3) / n )
B<-split.default(bb3, f)
B
is close to what I want, but just for my first list B[["1"]]
the other 5 lists are empty and I get the following error:
Warning message:
In split.default(bb3, f) : data length is not a multiple of split variable
TL:DR I want to generate a list divided in 6 chunks, each with the 102 clvs
with consecutive date
. Even if they are not balanced, if they are consecutive I can balanced them with make.pbalanced
. This will be my ideal solution
Second attempt
I generated Index
to identify every date
of every clvs
(Index
is already generated in bb3
but for the sake of the exercise I am explaining how I did it)
bb3<- bb3 %>%
group_by(clvs) %>%
mutate(Index = row_number())
The problem is that I don't know how to split it:
si<-split.data.frame(bb3,bb3$Index)
Again si
closely relates to what I need, it gives me a list of 551 balance list (102 clvs
and one date
). I want to divide bb3
given 19 segments of Index
. This list (1:19,20:39,40:59,...) will determine how I divide my data frame
I am sure there should be an easy way with function
and lapply
but I can't come with this solution.
TL:DR I want to divide in 19 chunks my data frame, each chunk with 102 clvs
each one with 29 periods date
: a list of 19 chunks each with 2,958 observations
The solution I got by @Allan Cameron cut my data base in the size I need it but it doesn't work as a balance panel because it doesn't cut consecutive my datetime
B <- split(bb3, 1:6)
head(B[["6"]])
clvs fecha hora pml pml_ene pml_per pml_cng datetime date
<fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <fct> <dttm>
1 07AC~ 2017~ 6 389. 388. 1.07 0 2017-02~ 2017-02-28 07:00:00
2 07AC~ 2017~ 12 677. 674. 3.25 0 2017-02~ 2017-02-28 13:00:00
3 07AC~ 2017~ 18 667. 664. 3.15 0 2017-02~ 2017-02-28 19:00:00
4 07AC~ 2017~ 24 666. 664. 2.44 0 2017-03~ 2017-03-01 01:00:00
5 07AC~ 2017~ 6 664. 662. 2.05 0 2017-03~ 2017-03-01 07:00:00
6 07AC~ 2017~ 12 434. 431. 2.97 0 2017-03~ 2017-03-01 13:00:00
tail(B[["6"]])
clvs fecha hora pml pml_ene pml_per pml_cng datetime date
<fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <fct> <dttm>
1 07XC~ 2017~ 18 430. 443. -13.6 0 2017-03~ 2017-03-21 19:00:00
2 07XC~ 2017~ 24 418. 426. -8.41 0 2017-03~ 2017-03-22 01:00:00
3 07XC~ 2017~ 6 442. 444. -2.15 0 2017-03~ 2017-03-22 07:00:00
4 07XC~ 2017~ 12 451. 464. -12.4 0 2017-03~ 2017-03-22 13:00:00
5 07XC~ 2017~ 18 450. 461. -11.2 0 2017-03~ 2017-03-22 19:00:00
6 07XC~ 2017~ 24 452. 467. -15 0 2017-03~ 2017-03-23 01:00:00
So I still can't balance my panel, nor use it with plm
How can I chunk it in 6 with consecutive dates?
Although it seems like an easy task, it was very challenging splitting a balanced panel data into small balance panels.
@Allan Cameron's answer got it right in the length of the list but not the content. My panels were unbalanced, each clvs
had 188 or 187 in the same chunk, and datetime
was not consecutive. B[["1"]]
had a sequence of 7:00
,13:00
and 19:00
for one clvs
for example. With unbalanced panels my loop with an splm
function didn't work.
The solution was using gl.unequal
:
library(DTK)
f<-gl.unequal(n=6,k=c(92,92,92,92,92,91))
B<-split(bb3,f)
This way I get balanced panels, for example B[["1"]]
head(B3[["1"]])
1 07AC~ 2017~ 1 686. 684. 2.19 0 2017-02~ 2017-02-28 02:00:00
2 07AC~ 2017~ 2 665. 664. 1.79 0 2017-02~ 2017-02-28 03:00:00
3 07AC~ 2017~ 3 393. 392. 1.11 0 2017-02~ 2017-02-28 04:00:00
4 07AC~ 2017~ 4 383. 381. 1.4 0 2017-02~ 2017-02-28 05:00:00
5 07AC~ 2017~ 5 383. 381. 1.41 0 2017-02~ 2017-02-28 06:00:00
6 07AC~ 2017~ 6 389. 388. 1.07 0 2017-02~ 2017-02-28 07:00:00
is.pbalanced(B[["1"]])
TRUE