Search code examples
regexnotepad++

Extracting text with RegEx in Notepad++ confusion


so I have a large body of text that I need to extract some text from. this is a small snippet of what some of it looks like.

pool-4-thread-54]"Sheet1 :name=Wagenaar, Larry CSA Term (4-15-13ALT).doc; " :Error adding or updating document. 
pool-4-thread-56]"Sheet1 :name=Kelly Services - 2nd Amendment to CLSA (11-13-13ALT).doc; " :Error adding or updating document. 
pool-4-thread-38]"Sheet1 :name=New Zealand Pharmaceuticals CDA 072313.doc; " :Error adding or updating document. 

I am using the following RegEx to get what I want out of it

(["'])(?:(?=(\\?))\2.)*?\1

I then looked into how to extract the text that matches the pattern and everything I Have read has said to use Find and Replace in Notepad++ and to replace the RegEx with /1 or $1

this doesn't make sense to me though because this just replaces the actual text the pattern found so I lose what I actually want to keep. Am I misunderstanding what I am supposed to do?

so lets say I have the line

pool-4-thread-54]"Sheet1 :name=Wagenaar, Larry CSA Term (4-15-13ALT).doc; " :Error adding or updating document. 

I do a find using the RegEx pattern and get the result of

"Sheet1 :name=Wagenaar, Larry CSA Term (4-15-13ALT).doc; " 

if I then replace that with

/1

then that line just becomes

pool-4-thread-54] :Error adding or updating document. 

any help is appreciated, thanks


Solution

  • To remove all the surrounding text and keep just what you need, use

    ^.*((["'])(?:(?!\\2).)*?\2).*
    

    And replace with $1 backreference. See the regex demo.

    Details:

    • ^ - start of string
    • .* - zero or more chars, other than line break chars, as may as possible
    • ((["'])(?:(?!\\2).)*?\2) - Group 1: a " or ' captured into Group 1, then any zero or more (but as few as possible) chars other than line break chars, each of which cannot be equal to the value captured in Group 2 (so, either "not '" or "not "")
    • .* - the rest of the line.

    I had to add ^.* (that matches the start of the line followed with zero or more characters other than a newline), then I enclosed your pattern into another capturing group (added ( in front and ) after) so that we could reference this submatch later in the replacement pattern with \1 backreference, and then added .* to match the rest of the line.

    Note that the backreferences in your pattern had to be renumbered.

    If you need to also remove linebreaks, add \R? (or \R* to match zero or more, to remove all empty lines if any) at the end of my regex.