Search code examples
rstrsplit

How to split strings between two specific characters (R)


I'm looking to split some scraped journal publication data neatly into columns (i.e. Author, Title, Journal etc.). I have done so for the most part, however I am stuck on the below entry that enters a \n line in the middle of the title.

structure(list(value = "               What wrist should you wear your actigraphy device on? Analysis of dominant vs.\n            non-dominant wrist actigraphy for measuring sleep in healthy adults. \n                     Sleep Science. \n                        10:132-135.\n             2017\n\n                 Full text if available"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

To work around this, instead of simply splitting at a \n line, I'd like to split the string at a location between a \n line AND a capital letter (so the title isn't split into two separate columns).

My original code to split at the \n line simply uses:

str_split_fixed(x,"\n", 2)[ ,2]

I've tried a number of combinations using regex lookahead/behind, but can't manage to figure out how to split between two characters and include those characters on either side.


Solution

  • You can use :

    strsplit(df$value, '\\n\\s+(?=[A-Z])', perl = TRUE)
    
    #[[1]]
    #[1] "               What wrist should you wear your actigraphy device on? Analysis of dominant vs.\n            non-dominant wrist actigraphy for measuring sleep in healthy adults. "
    #[2] "Sleep Science. \n                        10:132-135.\n             2017"                                                                                                         
    #[3] "Full text if available"                                                          
    

    This splits the text at newline character, followed by one or more whitespaces, followed by a capital letter. We use positive lookahead regex for the capital letter so that it remains in the string.