Search code examples
rregexstringregex-lookarounds

How to Split Columns in R based on First Space


I have this code that splits the column on the second space, but I don't know how to modify it to split on the first space only. I'm not that familiar with regex.

library(tidyr)

df <- data.frame(Location = c("San Jose CA", "Fremont CA", "Santa Clara CA"))
separate(df, Location, into = c("city", "state"), sep = " (?=[^ ]+$)")

#          city state
# 1    San Jose    CA
# 2     Fremont    CA
# 3 Santa Clara    CA

Solution

  • You can use

    library(tidyr)
    df <- data.frame(Location = c("San Jose CA", "Fremont CA", "Santa Clara CA"))
    df_new <- separate(df, Location, into = c("city", "state"), sep = "^\\S*\\K\\s+")
    

    Output:

    > df_new
         city      state
    1     San    Jose CA
    2 Fremont         CA
    3   Santa   Clara CA
    

    The ^\S*\K\s+ regex matches

    • ^ - start of string
    • \S* - zero or more non-whitespace chars
    • \K - match reset operator that discards the text matched so far from the overall match memory buffer
    • \s+ - one or more whitespace chars.

    NOTE: If your strings can have leading whitespace, and you want to ignore this leading whitespace, you can add \\s* right after ^ and use

    sep = "^\\s*\\S+\\K\\s+"
    

    Here, \S+ will require at least one (or more) non-whitespace chars to exist before the whitespaces that the string is split with.