I'm looking to split some scraped journal publication data neatly into columns (i.e. Author, Title, Journal etc.). I have done so for the most part, however I am stuck on the below entry that enters a \n line in the middle of the title.
structure(list(value = " What wrist should you wear your actigraphy device on? Analysis of dominant vs.\n non-dominant wrist actigraphy for measuring sleep in healthy adults. \n Sleep Science. \n 10:132-135.\n 2017\n\n Full text if available"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
To work around this, instead of simply splitting at a \n line, I'd like to split the string at a location between a \n line AND a capital letter (so the title isn't split into two separate columns).
My original code to split at the \n line simply uses:
str_split_fixed(x,"\n", 2)[ ,2]
I've tried a number of combinations using regex lookahead/behind, but can't manage to figure out how to split between two characters and include those characters on either side.
You can use :
strsplit(df$value, '\\n\\s+(?=[A-Z])', perl = TRUE)
#[[1]]
#[1] " What wrist should you wear your actigraphy device on? Analysis of dominant vs.\n non-dominant wrist actigraphy for measuring sleep in healthy adults. "
#[2] "Sleep Science. \n 10:132-135.\n 2017"
#[3] "Full text if available"
This splits the text at newline character, followed by one or more whitespaces, followed by a capital letter. We use positive lookahead regex for the capital letter so that it remains in the string.