I have a data frame with 3523 observation and 92 variables.
Below an example of a data frame with 6; the 24h recording of the observations starts at 4:00am and ends 4:00am .
04:00 04:15 04:30 05:00 ... 04:35
1 - - - - ... -
2 2 2 2 - ... -
3 2 - - 2 ... -
4 - - 2 - ... -
5 - - - - ... -
6 - - - - ... 2
Each row contain values '-' and '2'.
I want to extract the beginning and the ending of the intervals starting with: '2'
For example 2: 04:15-04:30;
3: 04:00 ; 05:00
4: 04:30
Thank you
Let's expand a bit your example. In the expanded example, we can note that there is no 2
for the row number 1, and that there are also several trickier ones, like for example row 6 where we have 2
, then a break (-
), after that a sequence of two 2
s, a -
, and a 2
again.
04:00 04:15 04:30 05:00 05:15 05:30
1: - - - - - -
2: 2 2 2 - 2 2
3: 2 - - 2 2 2
4: - - 2 - 2 2
5: - - - - 2 2
6: 2 - 2 2 - 2
7: - - - - 2 2
8: 2 2 - 2 2 2
9: - - - - 2 2
10: 2 2 - 2 2 2
You can reproduce it if you type in:
WorkSchedulesDay1 <- structure(list(`04:00` = c("-", "2", "2", "-", "-", "2", "-",
"2", "-", "2"), `04:15` = c("-", "2", "-", "-", "-", "-", "-",
"2", "-", "2"), `04:30` = c("-", "2", "-", "2", "-", "2", "-",
"-", "-", "-"), `05:00` = c("-", "-", "2", "-", "-", "2", "-",
"2", "-", "2"), `05:15` = c("-", "2", "2", "2", "2", "-", "2",
"2", "2", "2"), `05:30` = c("-", "2", "2", "2", "2", "2", "2",
"2", "2", "2")), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
After that you apply the code:
WorkSchedulesDay1 <- WorkSchedulesDay1 %>%
group_by(rn = row_number()) %>%
gather(time, val, 1:6) %>%
arrange(time) %>%
mutate(tmp = cumsum(coalesce(val != lag(val), FALSE))) %>% arrange(rn) %>%
filter(!val == "-") %>%
group_by(rn, tmp) %>%
mutate(
time = case_when(
n() > 1 ~ paste(min(time), max(time), sep = " - "),
TRUE ~ time
)
) %>%
ungroup() %>% distinct(rn, tmp, time) %>%
group_by(rn) %>%
mutate(
intervals = case_when(
n() > 1 ~ paste(time, collapse = ", "),
TRUE ~ time
)
) %>% distinct(rn, intervals) %>%
write_csv("WorkSchedulesDay1.csv")
You will see that what you get is:
rn intervals
<int> <chr>
2 04:00 - 04:30, 05:15 - 05:30
3 04:00, 05:00 - 05:30
4 04:30, 05:15 - 05:30
5 05:15 - 05:30
6 04:00, 04:30 - 05:00, 05:30
7 05:15 - 05:30
8 04:00 - 04:15, 05:00 - 05:30
9 05:15 - 05:30
10 04:00 - 04:15, 05:00 - 05:30
There is no record for the row number 1, simply because there are only -
in there.
Similarly, there is no record for 05:00
in row number 2, simply because there is a -
in there.
In a similar fashion, there is 04:00, 04:30 - 05:00, 05:30
for row number 6, because there are -
for 04:15
and 05:15
.