I have a large csv dataset with more than 45k rows and 19 different variables. I'd like to filter it by a specific variable (V4) so that each filtered group starts with 0 and then the next 0 will mark the start of a new group/dataframe/datatable, while keeping all other variables inside this new table as well. I need those separate groups to further analyse each case of data. I tried:
filtered_data <- my_data %>%
group_by("V4") %>%
filter("V4" == 0 & "V4" !=0)
View(filtered_data)
The first "V4" == 0 seems to work but I'm struggling how to define the end of each filtered dataframe e.g. how to filter from 0 to 3, then 0 to 5 etc. How can I determine the length of each case? Is there a logical operator that saves each group before V4 turns 0 again? Or would it be better to create a loop?
Example of my_data:
V1 V2 V3 V4 . . . V19
1 0
2 1
3 2
4 ` 3
5 0
6 1
7 2
8 3
9 4
10 5
11 0
...
45k
Here is a way to group your rows with basic arithmetic.
I create the groups using a cumulative sum of an indicator variable (V4 is 0 or not) and split the data.frame into single dataframes using group_split
.
# example data 12000 rows in total, 4000 groups of 3 rows
df <- data.frame(V1 = 1:12000,
V2 = sample(LETTERS, 12000, replace = T),
V4 = rep(0:2, 4000))
df <- df %>%
mutate(Groups = ifelse(V4 == 0, 1, 0),
Groups = cumsum(Groups)) %>%
group_split(Groups)
So the first group/dataframe is
> df[[1]]
# A tibble: 3 x 4
V1 V2 V4 Groups
<int> <chr> <int> <dbl>
1 1 L 0 1
2 2 L 1 1
3 3 Y 2 1
the second
> df[[2]]
# A tibble: 3 x 4
V1 V2 V4 Groups
<int> <chr> <int> <dbl>
1 4 Z 0 2
2 5 N 1 2
3 6 Y 2 2
and so on.
If you want to save each data.frame seperately you could use something like this:
# new environment that holds all data.frames
dfEnv <- new.env()
df %>%
mutate(Groups = ifelse(V4 == 0, 1, 0),
Groups = cumsum(Groups)) %>%
group_by(Groups) %>%
do({
# save every group inside the new environment as a single data.frame
dfEnv[[paste0("Group_", unique(.$Groups))]] <- .
})
Now you have dfEnv$Group_1
, dfEnv$Group_2
, ... and so on.
Inside do()
you could also use saveRDS
or write.csv
to save the data to disk.