Search code examples
regextext-extraction

RegEx to find final occurance of a string in data


I've been looking around for a similar example of this, but I haven't been able to make anything work quite right for my use-case.

I have a block of data that deals with user titles, and includes historical data. For each user, depending on whether they've had a title change before, the data can have between 1 and maybe 10 occurrences, stacked up by date. I need the most recent title, which seems to always be the one on the bottom, and is identified by whatever follows the string "customBusinessTitle="

Here's an example of what the data looks like:

id=2039, employeeId=123, customEffectiveDate7=2018-02-22, customFINDepartment=Sales & Marketing, customBusinessTitle=Vice President, Communications, customTitleDetails=; id=2457, employeeId=123, customEffectiveDate7=2021-05-10, customFINDepartment=Communications, customBusinessTitle=Vice President, Communications, customTitleDetails=; id=5735, employeeId=123, customEffectiveDate7=2022-09-16, customFINDepartment=Communications, customBusinessTitle=Vice President, Communications, customTitleDetails=Vice President, Communications; id=7769, employeeId=123, customEffectiveDate7=2022-09-18, customFINDepartment=Communications, customBusinessTitle=Vice President, Communications, customTitleDetails=Vice President, Communications; id=9811, employeeId=123, customEffectiveDate7=2023-01-01, customFINDepartment=Communications, customBusinessTitle=Vice President, Communications, customTitleDetails=Vice President, Communications

My plan was to use the strings before and after the actual title information, to "bracket" what I need to extract - which in this case is "Vice President, Communications" as it appears between "customBusinessTitle=" and ", customTitleDetails", for the most recent (so final) occurrence in the data.

A couple of searches I've tried on https://regex101.com:

(?<=customBusinessTitle=)(.+)(?=, customTitleDetails)
  • This seems to return the first occurrence only - this is similar, but I need the most recent title by date, which seems to be at the bottom.
(?<=customBusinessTitle=)([^,]*)(?:(?!.*customBusinessTitle=).)*(?<!, customTitleDetails=)
  • I'm honestly a little confused about this one; it seems to return multiple occurrences of "Vice President" but cuts off ", Communications" (which I need), and then in the final occurrence, it matches everything from "customBusinessTitle" until the end of the line, which is MORE than I need. I'm including it only because it does something different with the last occurrence.
(?<=customBusinessTitle=)([^,]*)(?:(?!.*customBusinessTitle=).)(?<!, customTitleDetails=)
  • This one seems closest for what I need - it only returns the final occurrence, but it cuts off at the comma in the title, whereas I need it to capture the whole title.

I feel pretty close with that last one, but I can't quite see how to get it to include the whole title instead of only "Vice President" - Any advice?

Thanks for looking!!

FINAL EDIT: Both @Bohemian and @David found workable solutions for this - thank you both, I wish I could mark more than one answer!

Thanks to everyone else who offered comments and suggestions as well, I appreciate you!


Solution

  • (?<=customBusinessTitle=)(?!.*customBusinessTitle=).*?(?=$|;|(, \w+=))
    

    Some improvements to @bohemian’s excellent answer: the end of the match can be any new key (, \w+=), the end of a group (;) or the end of the string ($).

    • preceeded by customBusinessTitle= (positive lookbehind)
    • and not followed by customBusinessTitle= (negative lookahead)
    • match anything as few times as possible
    • and followed by line end ($) or group end (;) or next key (, \w+=). (positive lookahead)

    Live demo