Search code examples
rregex-groupstring-matching

regex R - Extract part of a string with variable formatting and content


I am collecting conference registration information. The form outputs each record in a string, with different products purchased (e.g., registration, workshops, field trips, etc.) all contained in a single string, with repeating structure but variable contents. I want to extract particular pieces of information for individual products from the records containing those products.

Example data:

x <- c("Workshop (May 8) (Amount: 55.00 USD, : Structured Decision Making) Field Trip (May 12) (Amount: 85.00 USD, : AK Wildlife Conservation Center) Total: 140.00 USD",
       "Workshop (May 8) (Amount: 55.00 USD, : Structured Decision Making) Total: 55.00 USD",
       "Conference Registration (Amount: 440.00 USD, : Regular) Total: 440.00 USD",
       "Conference Registration (Amount: 440.00 USD, : Regular) Workshop (May 8) (Amount: 55.00 USD, : TuktuTools and Earth Engine) Total: 495.00 USD",
       "Conference Registration (Amount: 440.00 USD, : Regular) Total: 440.00 USD",
       "Conference Registration (Amount: 440.00 USD, : Regular) Conference Support (Amount: 5.00 USD, Quantity: 10)  Discount: -489.95 Total: 0.05 USD  Coupon used: JFtest",
       "Conference Registration (Amount: 440.00 USD, : Regular) Workshop (May 8) (Amount: 40.00 USD, : Storytelling) Field Trip (May 12) (Amount: 65.00 USD, : Vegetation Walk) Banquet, Additonal Ticket(s) (Amount: 65.00 USD, Quantity: 1)  Discount: -609.94 Total: 0.06 USD  Coupon used: JFtest",
       "Conference Registration (Amount: 180.00 USD, : Single-day (May 10)) Workshop (May 8) (Amount: 55.00 USD, : Structured Decision Making)  Discount: -234.98 Total: 0.02 USD  Coupon used: JFtest",
       "Conference Registration (Amount: 180.00 USD, : Single-day (May 9))  Discount: -179.98 Total: 0.02 USD  Coupon used: JFtest",
       "Conference Registration (Amount: 440.00 USD, : Regular)  Discount: -440.00 Total: 0.00 USD  Coupon used: ARCTIC2023")

Pulling the total cost and the coupon code used (if any) works fine

as.numeric(str_extract(x, '(?<=Total: ).+(?= USD)'))

str_extract(x, '(?<= Coupon used: ).+')

Where I am really struggling is for elements of specific products that occur within parentheses. I want to extract particular pieces of information (e.g., registration type, workshop type) for records containing that product. In an ideal world I'd eventually like to have, for each product: price, product option, and quantity (if applicable, only for conference support and banquet tickets).

Desired outcome for registration type:

c(NA, NA, "Regular", "Regular",  "Regular", "Regular", "Regular", "Single-day (May 10)", "Single-day (May 9)", "Regular")

Desired outcome for workshop type:

c("Structured Decision Making", "Structured Decision Making", NA, "TuktuTools and Earth Engine", NA, NA, "Storytelling", "Structured Decision Making", NA, NA)

I tried using stringr functions for this, wrapped up in an if_else search for whether a given product is included, but this only works when registration is the only product.

library(stringr)
library(dplyr)
if_else(
  str_detect(string = x, pattern = 'Conference Registration'),
  str_extract(x, '(?<=USD, : ).+(?=\\) )'),
  NA_character_)

I also tried using gsub to search directly for the registration-relevant information, but am challenged by the variable nature of the string I am searching for, which sometimes contains just letters and other times includes dashes, numbers, and/or parentheses. This also may result in multiple matches per string, since multiple products have a "[product name] (Amount: [price] USD, : [product option])" format.

gsub(pattern = '(Conference Registration \\(Amount: \\d+\\.\\d{2} USD\\, \\: )(.+)(\\) \\w+.*)',
    replacement = '\\2',
    x = x)

I got closer by trying to hard code in all the variability I might see, but this is pretty cumbersome and it picked up some undesired elements (mismatches) and I used up all 9 of the gsub backreferences for replacement on my first two products, leaving me without a way to identify information for the other products.

pat <- '^(Conference Registration \\(Amount: )?(\\d+\\.\\d{2})?( USD, : )?(\\w+-?\\w*\\s?\\(?\\w*\\s?\\d*\\)?)?(\\)\\s)?(Workshop \\(May 8\\) \\(Amount: )?(\\d+\\.\\d{2})?( USD, : )?(\\w+\\s?\\w*\\s?\\w*\\s?\\w*)?(\\)\\s)?(Field Trip \\(May 12\\) \\(Amount: )?(\\d+\\.\\d{2})?( USD, : )?(\\w+\\s?\\w*\\s?\\w*\\s?\\w*)?(\\)\\s)?(.*)$'
gsub(pattern = pat, replacement = '\\2', x = test)
## That seems to work for registration price
gsub(pattern = pat, replacement = '\\4', x = test)
## That works for registration type
gsub(pattern = pat, replacement = '\\7', x = test)
## That works for workshop price
gsub(pattern = pat, replacement = '\\9', x = test)
## Mixed results. This pulls the workshop names, but when those do not exist it
## includes other parts of the string, like "Total" and "Conference Support".
## It would not be hard to filter those out after the fact, but this is not ideal.
gsub(pattern = pat, replacement = '\\12', x = test)
## Does not work as gsub can only have up to 9 replacement values

Any suggestions to extract data from the strings would be appreciated. I looked at a number of SO posts, including about extracting text in parentheses (e.g., https://stackoverflow.com/questions/28955367/extract-text-in-parentheses-in-r), but none seemed to cover the variable contents and matching formats in my case, where I only want information from some parentheses, not all.


Solution

  • We can use .+? instead of .+ to get the shortest match. For example,

    # long result
    str_extract('aa_1_aa_2222222_aa', 'aa_.+_aa')
    # [1] "aa_1_aa_2222222_aa"
    
    # shortest result
    str_extract('aa_1_aa_2222222_aa', 'aa_.+?_aa')
    # [1] "aa_1_aa"
    

    To extract the info for the toy data,

    library(stringr)
    
    # registration price
    str_extract(x, '(?<=Conference Registration \\(Amount: ).+?(?= USD)')
    # [1] NA       NA       "440.00" "440.00" "440.00" "440.00" "440.00" "180.00" "180.00" "440.00"
    
    # registration type
    str_extract(x, '(?<=Conference Registration .{1,100}?, : ).+?(?=\\) )')
    # [1] NA                    NA                    "Regular"             "Regular"             "Regular"             "Regular"             "Regular"            
    # [8] "Single-day (May 10)" "Single-day (May 9)"  "Regular"
    
    # workshop price
    str_extract(x, '(?<=Workshop .{1,100}?Amount: )[0-9\\.]+')
    # [1] "55.00" "55.00" NA      "55.00" NA      NA      "40.00" "55.00" NA      NA 
    
    # workshop type
    str_extract(x, '(?<=Workshop .{1,100}?, : ).+?(?=\\) )')
    # [1] "Structured Decision Making"  "Structured Decision Making"  NA                            "TuktuTools and Earth Engine" NA                           
    # [6] NA                            "Storytelling"                "Structured Decision Making"  NA                            NA