for simplicity and reproducibility, I use here a simple data frame:
set.seed(1234)
df <- data.frame(v1 = sample(c("A", "B", "C", "D", "E", "F"), 100,
replace = TRUE, prob = c(0.1,0.2,0.2,0.2,0.2,0.1)))
My real data set contains several pages scraped from a pdf document. Imagine that "A" indicates that a new page begins. So, for example, up to the row where the first "A" shows up, all the data belongs to the first page.
By using following code, I easily get the row indices where a new page begins:
page <- which(df$v1 == "A")
Result: in the rows 14 28 39 81 92 we observe an "A".
In order to be able to group the data by pages, I want to create a new variable which indicates the page number. So, I want to assign all rows with a row index < 14 a value of 1, all rows with an index between 14 and 27 a value of 2, all rows with an index between 28 and 38 a value of 3 and so on.
Of course my data set is much larger than this example, so that a simple ifelse()
solution with specified conditions is not efficient. Furthermore, I want to have general code which I can apply to other data (scraped from other pdfs) which will have the "A"'s at different positions.
I already searched a lot on the internet but I could not find something similar to my situation. I would be very grateful if someone could help me since I do not know how to handle this situation.
Thanks a lot in advance!
You can use cumsum
df$page = cumsum(df$v1 == "A") + 1L