Regex pattern to count lines in poems with randomly \n or \n\n as line breaks

I need to count the lines of 221 poems and tried counting the line breaks \n.

However, some lines have double line breaks \n\n to make a new verse. These I only want counted as one. The amount and position of double line breaks is random in each poem.

Minimal working example:

library("quanteda")

poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"

poems <- quanteda::corpus(poem1, poem2)

The resulting line count should be 5 lines for poem1 and 4 lines for poem2.

I tried stringi::stri_count_fixed(texts(poems), pattern = "\n"), but the regex pattern is not elaborate enough to account for the random double line break problem.

Solution

You can use stringr::str_count with the \R+ pattern to find the number of consecutive line break sequences in the string:

> poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
> poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
> library(stringr)
> str_count(poem1, "\\R+")
[1] 4
> str_count(poem2, "\\R+")
[1] 3

So the line count is str_count(x, "\\R+") + 1.

The \R pattern matches any line break sequence, CRLF, LF or CR. \R+ matches a sequence of one or more such line break sequence.

See the R code DEMO online:

poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
library(stringr)
str_count(poem1, "\\R+")
# => [1] 4
str_count(poem2, "\\R+")
# => [1] 3
## Line counts:
str_count(poem1, "\\R+") + 1
# => [1] 5
str_count(poem2, "\\R+") + 1
# => [1] 4