Search code examples
rstringcategoriesdata-manipulationrecode

Creating an categorical variable conditioning on one specific value and its row indices of another variable


for simplicity and reproducibility, I use here a simple data frame:

set.seed(1234)
df <- data.frame(v1 = sample(c("A", "B", "C", "D", "E", "F"), 100, 
                 replace = TRUE, prob = c(0.1,0.2,0.2,0.2,0.2,0.1)))

My real data set contains several pages scraped from a pdf document. Imagine that "A" indicates that a new page begins. So, for example, up to the row where the first "A" shows up, all the data belongs to the first page.

By using following code, I easily get the row indices where a new page begins:

page <- which(df$v1 == "A")

Result: in the rows 14 28 39 81 92 we observe an "A".

In order to be able to group the data by pages, I want to create a new variable which indicates the page number. So, I want to assign all rows with a row index < 14 a value of 1, all rows with an index between 14 and 27 a value of 2, all rows with an index between 28 and 38 a value of 3 and so on.

Of course my data set is much larger than this example, so that a simple ifelse() solution with specified conditions is not efficient. Furthermore, I want to have general code which I can apply to other data (scraped from other pdfs) which will have the "A"'s at different positions.

I already searched a lot on the internet but I could not find something similar to my situation. I would be very grateful if someone could help me since I do not know how to handle this situation.

Thanks a lot in advance!


Solution

  • You can use cumsum

    df$page = cumsum(df$v1 == "A") + 1L