I need to count the lines of 221 poems and tried counting the line breaks \n.
However, some lines have double line breaks \n\n to make a new verse. These I only want counted as one. The amount and position of double line breaks is random in each poem.
Minimal working example:
library("quanteda")
poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
poems <- quanteda::corpus(poem1, poem2)
The resulting line count should be 5 lines for poem1
and 4 lines for poem2
.
I tried stringi::stri_count_fixed(texts(poems), pattern = "\n")
, but the regex pattern is not elaborate enough to account for the random double line break problem.
You can use stringr::str_count
with the \R+
pattern to find the number of consecutive line break sequences in the string:
> poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
> poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
> library(stringr)
> str_count(poem1, "\\R+")
[1] 4
> str_count(poem2, "\\R+")
[1] 3
So the line count is str_count(x, "\\R+") + 1
.
The \R
pattern matches any line break sequence, CRLF, LF or CR. \R+
matches a sequence of one or more such line break sequence.
See the R code DEMO online:
poem1 <- "This is a line\nThis is a line\n\nAnother line\n\nAnd another one\nThis is the last one"
poem2 <- "Some poetry\n\nMore poetic stuff\nAnother very poetic line\n\nThis is the last line of the poem"
library(stringr)
str_count(poem1, "\\R+")
# => [1] 4
str_count(poem2, "\\R+")
# => [1] 3
## Line counts:
str_count(poem1, "\\R+") + 1
# => [1] 5
str_count(poem2, "\\R+") + 1
# => [1] 4