I am collecting conference registration information. The form outputs each record in a string, with different products purchased (e.g., registration, workshops, field trips, etc.) all contained in a single string, with repeating structure but variable contents. I want to extract particular pieces of information for individual products from the records containing those products.
Example data:
x <- c("Workshop (May 8) (Amount: 55.00 USD, : Structured Decision Making) Field Trip (May 12) (Amount: 85.00 USD, : AK Wildlife Conservation Center) Total: 140.00 USD",
"Workshop (May 8) (Amount: 55.00 USD, : Structured Decision Making) Total: 55.00 USD",
"Conference Registration (Amount: 440.00 USD, : Regular) Total: 440.00 USD",
"Conference Registration (Amount: 440.00 USD, : Regular) Workshop (May 8) (Amount: 55.00 USD, : TuktuTools and Earth Engine) Total: 495.00 USD",
"Conference Registration (Amount: 440.00 USD, : Regular) Total: 440.00 USD",
"Conference Registration (Amount: 440.00 USD, : Regular) Conference Support (Amount: 5.00 USD, Quantity: 10) Discount: -489.95 Total: 0.05 USD Coupon used: JFtest",
"Conference Registration (Amount: 440.00 USD, : Regular) Workshop (May 8) (Amount: 40.00 USD, : Storytelling) Field Trip (May 12) (Amount: 65.00 USD, : Vegetation Walk) Banquet, Additonal Ticket(s) (Amount: 65.00 USD, Quantity: 1) Discount: -609.94 Total: 0.06 USD Coupon used: JFtest",
"Conference Registration (Amount: 180.00 USD, : Single-day (May 10)) Workshop (May 8) (Amount: 55.00 USD, : Structured Decision Making) Discount: -234.98 Total: 0.02 USD Coupon used: JFtest",
"Conference Registration (Amount: 180.00 USD, : Single-day (May 9)) Discount: -179.98 Total: 0.02 USD Coupon used: JFtest",
"Conference Registration (Amount: 440.00 USD, : Regular) Discount: -440.00 Total: 0.00 USD Coupon used: ARCTIC2023")
Pulling the total cost and the coupon code used (if any) works fine
as.numeric(str_extract(x, '(?<=Total: ).+(?= USD)'))
str_extract(x, '(?<= Coupon used: ).+')
Where I am really struggling is for elements of specific products that occur within parentheses. I want to extract particular pieces of information (e.g., registration type, workshop type) for records containing that product. In an ideal world I'd eventually like to have, for each product: price, product option, and quantity (if applicable, only for conference support and banquet tickets).
Desired outcome for registration type:
c(NA, NA, "Regular", "Regular", "Regular", "Regular", "Regular", "Single-day (May 10)", "Single-day (May 9)", "Regular")
Desired outcome for workshop type:
c("Structured Decision Making", "Structured Decision Making", NA, "TuktuTools and Earth Engine", NA, NA, "Storytelling", "Structured Decision Making", NA, NA)
I tried using stringr
functions for this, wrapped up in an if_else
search for whether a given product is included, but this only works when registration is the only product.
library(stringr)
library(dplyr)
if_else(
str_detect(string = x, pattern = 'Conference Registration'),
str_extract(x, '(?<=USD, : ).+(?=\\) )'),
NA_character_)
I also tried using gsub
to search directly for the registration-relevant information, but am challenged by the variable nature of the string I am searching for, which sometimes contains just letters and other times includes dashes, numbers, and/or parentheses. This also may result in multiple matches per string, since multiple products have a "[product name] (Amount: [price] USD, : [product option])" format.
gsub(pattern = '(Conference Registration \\(Amount: \\d+\\.\\d{2} USD\\, \\: )(.+)(\\) \\w+.*)',
replacement = '\\2',
x = x)
I got closer by trying to hard code in all the variability I might see, but this is pretty cumbersome and it picked up some undesired elements (mismatches) and I used up all 9 of the gsub
backreferences for replacement on my first two products, leaving me without a way to identify information for the other products.
pat <- '^(Conference Registration \\(Amount: )?(\\d+\\.\\d{2})?( USD, : )?(\\w+-?\\w*\\s?\\(?\\w*\\s?\\d*\\)?)?(\\)\\s)?(Workshop \\(May 8\\) \\(Amount: )?(\\d+\\.\\d{2})?( USD, : )?(\\w+\\s?\\w*\\s?\\w*\\s?\\w*)?(\\)\\s)?(Field Trip \\(May 12\\) \\(Amount: )?(\\d+\\.\\d{2})?( USD, : )?(\\w+\\s?\\w*\\s?\\w*\\s?\\w*)?(\\)\\s)?(.*)$'
gsub(pattern = pat, replacement = '\\2', x = test)
## That seems to work for registration price
gsub(pattern = pat, replacement = '\\4', x = test)
## That works for registration type
gsub(pattern = pat, replacement = '\\7', x = test)
## That works for workshop price
gsub(pattern = pat, replacement = '\\9', x = test)
## Mixed results. This pulls the workshop names, but when those do not exist it
## includes other parts of the string, like "Total" and "Conference Support".
## It would not be hard to filter those out after the fact, but this is not ideal.
gsub(pattern = pat, replacement = '\\12', x = test)
## Does not work as gsub can only have up to 9 replacement values
Any suggestions to extract data from the strings would be appreciated. I looked at a number of SO posts, including about extracting text in parentheses (e.g., https://stackoverflow.com/questions/28955367/extract-text-in-parentheses-in-r), but none seemed to cover the variable contents and matching formats in my case, where I only want information from some parentheses, not all.
We can use .+?
instead of .+
to get the shortest match. For example,
# long result
str_extract('aa_1_aa_2222222_aa', 'aa_.+_aa')
# [1] "aa_1_aa_2222222_aa"
# shortest result
str_extract('aa_1_aa_2222222_aa', 'aa_.+?_aa')
# [1] "aa_1_aa"
To extract the info for the toy data,
library(stringr)
# registration price
str_extract(x, '(?<=Conference Registration \\(Amount: ).+?(?= USD)')
# [1] NA NA "440.00" "440.00" "440.00" "440.00" "440.00" "180.00" "180.00" "440.00"
# registration type
str_extract(x, '(?<=Conference Registration .{1,100}?, : ).+?(?=\\) )')
# [1] NA NA "Regular" "Regular" "Regular" "Regular" "Regular"
# [8] "Single-day (May 10)" "Single-day (May 9)" "Regular"
# workshop price
str_extract(x, '(?<=Workshop .{1,100}?Amount: )[0-9\\.]+')
# [1] "55.00" "55.00" NA "55.00" NA NA "40.00" "55.00" NA NA
# workshop type
str_extract(x, '(?<=Workshop .{1,100}?, : ).+?(?=\\) )')
# [1] "Structured Decision Making" "Structured Decision Making" NA "TuktuTools and Earth Engine" NA
# [6] NA "Storytelling" "Structured Decision Making" NA NA