Search code examples
rdataframesplit

How to split my data frame in equal length lists


I am working with a big data base and I am trying to chunk a balance panel in balanced panels so I can perform a plm loop regression.

I have been trying to divide my data frame bb3 (balance grouping) but I can't manage to get what I want, this is my sample data bb3 in my drive as a .RData

bb3

First attempt

Following this question I tried using split.default :

n<-6
f<-gl(n,nrow(bb3) / n )
B<-split.default(bb3, f)

B is close to what I want, but just for my first list B[["1"]] the other 5 lists are empty and I get the following error:

Warning message:
In split.default(bb3, f) : data length is not a multiple of split variable

TL:DR I want to generate a list divided in 6 chunks, each with the 102 clvs with consecutive date. Even if they are not balanced, if they are consecutive I can balanced them with make.pbalanced. This will be my ideal solution

Second attempt

I generated Index to identify every date of every clvs (Index is already generated in bb3 but for the sake of the exercise I am explaining how I did it)

bb3<- bb3 %>% 
  group_by(clvs) %>% 
  mutate(Index = row_number())

The problem is that I don't know how to split it:

si<-split.data.frame(bb3,bb3$Index)

Again si closely relates to what I need, it gives me a list of 551 balance list (102 clvs and one date). I want to divide bb3 given 19 segments of Index. This list (1:19,20:39,40:59,...) will determine how I divide my data frame

I am sure there should be an easy way with function and lapply but I can't come with this solution.

TL:DR I want to divide in 19 chunks my data frame, each chunk with 102 clvs each one with 29 periods date: a list of 19 chunks each with 2,958 observations

The solution I got by @Allan Cameron cut my data base in the size I need it but it doesn't work as a balance panel because it doesn't cut consecutive my datetime

B <- split(bb3, 1:6)
head(B[["6"]])
clvs  fecha hora    pml pml_ene pml_per pml_cng datetime date               
  <fct> <chr> <chr> <dbl>   <dbl>   <dbl>   <dbl> <fct>    <dttm>             
1 07AC~ 2017~ 6      389.    388.    1.07       0 2017-02~ 2017-02-28 07:00:00
2 07AC~ 2017~ 12     677.    674.    3.25       0 2017-02~ 2017-02-28 13:00:00
3 07AC~ 2017~ 18     667.    664.    3.15       0 2017-02~ 2017-02-28 19:00:00
4 07AC~ 2017~ 24     666.    664.    2.44       0 2017-03~ 2017-03-01 01:00:00
5 07AC~ 2017~ 6      664.    662.    2.05       0 2017-03~ 2017-03-01 07:00:00
6 07AC~ 2017~ 12     434.    431.    2.97       0 2017-03~ 2017-03-01 13:00:00

tail(B[["6"]])

clvs  fecha hora    pml pml_ene pml_per pml_cng datetime date               
  <fct> <chr> <chr> <dbl>   <dbl>   <dbl>   <dbl> <fct>    <dttm>             
1 07XC~ 2017~ 18     430.    443.  -13.6        0 2017-03~ 2017-03-21 19:00:00
2 07XC~ 2017~ 24     418.    426.   -8.41       0 2017-03~ 2017-03-22 01:00:00
3 07XC~ 2017~ 6      442.    444.   -2.15       0 2017-03~ 2017-03-22 07:00:00
4 07XC~ 2017~ 12     451.    464.  -12.4        0 2017-03~ 2017-03-22 13:00:00
5 07XC~ 2017~ 18     450.    461.  -11.2        0 2017-03~ 2017-03-22 19:00:00
6 07XC~ 2017~ 24     452.    467.  -15          0 2017-03~ 2017-03-23 01:00:00

So I still can't balance my panel, nor use it with plm

How can I chunk it in 6 with consecutive dates?


Solution

  • Although it seems like an easy task, it was very challenging splitting a balanced panel data into small balance panels.

    @Allan Cameron's answer got it right in the length of the list but not the content. My panels were unbalanced, each clvs had 188 or 187 in the same chunk, and datetime was not consecutive. B[["1"]] had a sequence of 7:00 ,13:00 and 19:00 for one clvs for example. With unbalanced panels my loop with an splm function didn't work.

    The solution was using gl.unequal :

    library(DTK)
    f<-gl.unequal(n=6,k=c(92,92,92,92,92,91))
    B<-split(bb3,f)
    

    This way I get balanced panels, for example B[["1"]]

    head(B3[["1"]])
    1 07AC~ 2017~ 1      686.    684.    2.19       0 2017-02~ 2017-02-28 02:00:00
    2 07AC~ 2017~ 2      665.    664.    1.79       0 2017-02~ 2017-02-28 03:00:00
    3 07AC~ 2017~ 3      393.    392.    1.11       0 2017-02~ 2017-02-28 04:00:00
    4 07AC~ 2017~ 4      383.    381.    1.4        0 2017-02~ 2017-02-28 05:00:00
    5 07AC~ 2017~ 5      383.    381.    1.41       0 2017-02~ 2017-02-28 06:00:00
    6 07AC~ 2017~ 6      389.    388.    1.07       0 2017-02~ 2017-02-28 07:00:00
    
    is.pbalanced(B[["1"]])
    TRUE