Search code examples
rstrsplit

R Split a title phrase into sub-phrases of a given maximum length


I have an R data frame where the columns have names such as the following:

"Goods excluding food purchased from stores and energy\nLast = 1.8"

"Books and reading material (excluding textbooks)\nLast = 136.1"

"Spectator entertainment (excluding video and audio subscription services)\nLast = -13.5"

There are a large number of columns. I want to insert newline characters where necessary, between words, so that these names consist of parts that are no longer than some given maximum, say MaxLen=18. And I want the last part, starting with the word "Last", to be on a separate line. In the three examples, the desired output is:

"Goods excluding\nfood purchased\nfrom stores and\nenergy\nLast = 1.8"

"Books and reading\nmaterial\n(excluding\ntextbooks)\nLast = 136.1"

"Spectator\nentertainment\n(excluding video\nand audio\nsubscription\nservices)\nLast = -13.5"

I have been trying to accomplish this with strsplit(), but without success. The parentheses and '=' sign may be part of my problem. The "\nLast = " portion is the same for all names.

Any suggestions much appreciated.


Solution

  • The strwrap function can help here, though you need to do a bit of work to keep the existing breaks. Consider this option

    input <- c("Goods excluding food purchased from stores and energy\nLast = 1.8",
    "Books and reading material (excluding textbooks)\nLast = 136.1",
    "Spectator entertainment (excluding video and audio subscription services)\nLast = -13.5")
    
    strsplit(input, "\n") |>
      lapply(function(s) unlist(sapply(s, strwrap, 18))) |>
      sapply(paste, collapse="\n")
    # [1] "Goods excluding\nfood purchased\nfrom stores and\nenergy\nLast = 1.8"                        
    # [2] "Books and reading\nmaterial\n(excluding\ntextbooks)\nLast = 136.1"                           
    # [3] "Spectator\nentertainment\n(excluding video\nand audio\nsubscription\nservices)\nLast = -13.5"
    

    Here we split the existing breaks, add new ones, then put it all back together.