Search code examples
rsapplystrsplit

R: Trimming a very long string with complete words with the beginning and end


Let's assume I have this dataframe:

df =data.frame(text=c("This is a very long sentence that I would like to trim because I might need to put it as a label somewhere",
               "This is another very long sentence that I would also like to trim because I might need to put it as who knows what"),col2=c("1234","5678"))

Following this post I have been able to get a new column that gets me the start of the sentence with complete words, which is fine.

df$short_txt = sapply(strsplit(df$text, ' '), function(i) paste(i[cumsum(nchar(i)) <= 20], collapse = ' '))

> df$short_txt
[1] "This is a very long"  "This is another very"

However, I would also be interested in pasting the end of complete words from 20 characters before the end, having something close to this output.

> df$short_txt
[1] "This is a very long...it as a label somewhere"  "This is another very...it as who knows what"

I can't figure out how to complete the sapply function to reach this outcome. I tried using the paste function and changing the cumsum function as df$short_txt = sapply(strsplit(df$text, ' '), function(i) paste(i[cumsum(nchar(i)) <= 20],"...",i[cumsum(nchar(i)) >= (nchar(i)-20)], collapse = ' ')) but it does not return what I want.

Appreciate the help.


Solution

  • Perhaps we can regex this?

    gsub("^(.{20}\\S*)\\b.*\\b(\\S*.{20})$", "\\1...\\2", df$text)
    # [1] "This is a very long sentence...as a label somewhere" "This is another very...it as who knows what"        
    

    Regex explanation:

    ^(.{20}\\S*)\\b.*\\b(\\S*.{20})$
    ^                              $   beginning and end of string, respectively
     (.........)        (.........)    first and second saved groups
      .{20}                  .{20}     exactly 20 characters of any kind
           \\S*          \\S*          zero or more non-space characters
                \\b  \\b               word boundaries
                   .*                  anything else (including nothing)
    

    This did not include your it at the beginning because without it, the substring is 20-long.

    I'll look at df$text[1] with various numbers for leading/trailing complete-word substrings.

    sapply(seq(10, 24, by = 2), function(len) gsub(sprintf("^(.{%d}\\S*)\\b.*\\b(\\S*.{%d})$", len, len), "\\1...\\2", df$text[1]))
    # [1] "This is a very... somewhere"                            
    # [2] "This is a very...label somewhere"                       
    # [3] "This is a very...label somewhere"                       
    # [4] "This is a very long... label somewhere"                 
    # [5] "This is a very long... a label somewhere"               
    # [6] "This is a very long sentence...as a label somewhere"    
    # [7] "This is a very long sentence...it as a label somewhere" 
    # [8] "This is a very long sentence... it as a label somewhere"
    

    I don't know off-hand how to protect against the spaces before/after the added ... here, but it can be cleaned up post-editing (safe as long as your strings don't natively contain "...").

    sapply(seq(10, 24, by = 2), function(len) gsub(sprintf("^(.{%d}\\S*)\\b.*\\b(\\S*.{%d})$", len, len), "\\1...\\2", df$text[1])) |>
      sub(" *(\\.\\.\\.) *", "\\1", x = _)
    # [1] "This is a very...somewhere"                            
    # [2] "This is a very...label somewhere"                      
    # [3] "This is a very...label somewhere"                      
    # [4] "This is a very long...label somewhere"                 
    # [5] "This is a very long...a label somewhere"               
    # [6] "This is a very long sentence...as a label somewhere"   
    # [7] "This is a very long sentence...it as a label somewhere"
    # [8] "This is a very long sentence...it as a label somewhere"