I am currently studying the Text Analysis in R book by Silge and Robinson and given my newbie status I can't come around to understanding exactly how this regex "^chapter [\\divxlc]"
works out the chapter numbers when tidying the texts. I have checked the regex101 engine (I may ignore also how to make it work for what I need). Can somebody help me out in figuring it out? This is the code I am referring to:
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
My take on it is that this will identify chapter numbers also written in roman numerals(`\d would have sufficed for decimals, I think). Is it so? Is the a general formula to identify chapter numbers regardless of its numbering? If so, how would it identify chapters III, XXI, etc where some roman numerals repeat?
I would appreciate any indication or reference to look for clarification.
Thanks in advance.
The character class matches a single character between the square brackets. If the character after "chapter (space)" is a Roman numeral, you already have a match, and don't particularly care what it is followed by. You could add +
to say "one or more" but this doesn't change which lines are matched, and omitting it saves a few cycles.