Search code examples
rstringsplit

How can I split a string on whitespaces or quotes if they're present


Context: I have a list of keywords that sometimes consist of one word (e.g. poisson, normal, ...) and sometimes consist of two words, which are then within single quotes ('Two-way ANOVA', 'Generalized linear model', ...). All keywords are separated by white spaces in a single string.

Question: How can extract each keyword of the list, accounting for the ones that are within single quotes ?

Example:

What I have:

kw <- "poisson normal 'negative binomial' log-likelihood"

What I want:

c("poisson", "normal", "negative binomial", "log-likelihood")

Solution

  • We could use a regex find all trick here and match on the following pattern:

    '.*?'|\S+
    

    This will eagerly try to find a singly-quoted term, and that failing will fallback to matching any other non quoted term.

    library(stringr)
    
    kw <- "poisson normal 'negative binomial' log-likelihood"
    output <- str_extract_all(kw, "'.*?'|\\S+")
    output
    
    [[1]]
    [1] "poisson"             "normal"              "'negative binomial'"
    [4] "log-likelihood"