Search code examples
rregexnlpdata-sciencequanteda

Regex pattern to count lines in poems with randomly \n or \n\n as line breaks


I need to count the lines of 221 poems and tried counting the line breaks \n.

However, some lines have double line breaks \n\n to make a new verse. These I only want counted as one. The amount and position of double line breaks is random in each poem.

Minimal working example:

library("quanteda")

poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"

poems <- quanteda::corpus(poem1, poem2)

The resulting line count should be 5 lines for poem1 and 4 lines for poem2.

I tried stringi::stri_count_fixed(texts(poems), pattern = "\n"), but the regex pattern is not elaborate enough to account for the random double line break problem.


Solution

  • You can use stringr::str_count with the \R+ pattern to find the number of consecutive line break sequences in the string:

    > poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
    > poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
    > library(stringr)
    > str_count(poem1, "\\R+")
    [1] 4
    > str_count(poem2, "\\R+")
    [1] 3
    

    So the line count is str_count(x, "\\R+") + 1.

    The \R pattern matches any line break sequence, CRLF, LF or CR. \R+ matches a sequence of one or more such line break sequence.

    See the R code DEMO online:

    poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
    poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
    library(stringr)
    str_count(poem1, "\\R+")
    # => [1] 4
    str_count(poem2, "\\R+")
    # => [1] 3
    ## Line counts:
    str_count(poem1, "\\R+") + 1
    # => [1] 5
    str_count(poem2, "\\R+") + 1
    # => [1] 4