Search code examples
rgsub

Gsubing when we have multiple backslashes and/special characters


I have a string in which I want to get out the city, in this example it would be 'Elland Rd' and 'Leeds'.

mystring = "0000\" club_info=\"Elland Rd, Leeds\" Pitch=\"100x50\""
city = gsub(".* club_info=\"(.*),(.+)\.*", "\\2", mystring) #cant get this part to work

My theory behind getting the city is to search for everything after the comma and up until the backslash but I cant seem to get it to recognize the backslash


Solution

  • I prefer strcapture to extract multiple patterns vice repeated gsubing, how about this?

    strcapture('.*club_info="([^"]+),([^"]+)".(.*)', mystring, list(x1="", x2="", x3=""))
    #          x1     x2             x3
    # 1 Elland Rd  Leeds Pitch="100x50"
    

    (It was not required to include the Pitch= in there, but I thought you might use it since it appears you're doing reductive gsubing.)

    FYI, x2 here has a leading space; it could be handled in the regex, but if you are not 100% positive it's in all cases, then it might be simpler to add trimws(.), as in

    strcapture('.*club_info="([^"]+),([^"]+)".(.*)', mystring, list(x1="", x2="", x3="")) |>
      lapply(trimws)
    # $x1
    # [1] "Elland Rd"
    # $x2
    # [1] "Leeds"
    # $x3
    # [1] "Pitch=\"100x50\""
    

    In this case it does drop from a data.frame to a list, but I'm not certain you need a frame, a named list should suffice. If you really want it as a frame --- and many of my use-cases really prefer that --- just add |> as.data.frame() to the pipe.

    Regex walk-through.

    .*club_info="([^"]+),([^"]+)".(.*)
    ^^                                  leading/trailing text, discarded
      ^^^^^^^^^^^                       literal text
                  [^"]+   [^"]+         one or more "any character except dquote"
                 (     ),(     )        two capture-groups
    

    Also, since we know that we'll have double quotes in the pattern and not single-quotes, I chose to use single-quotes as the outer string-defining demarcation. If we have both or if you want to avoid double-backslashes and the like, we can use R's "raw strings" instead,

    r"{.*club_info="([^"]+),([^"]+)".(.*)}"
    

    where the r"{ and }" are the open/close delimiters; I chose braces here since parens are visually confusing with the regex-parens, though brackets r"[/]" and parens r"(/)" also work.