Search code examples
pipenotepad++

Remove string data between double quotes to de-identify data/remove PHI in Notepad++


How can I remove string data between double quotes in a Pipe Delimited file? Would like to maintain file structure and could replace the string data with "NULL" or a space? File can have multiple rows with 150 + columns with Alpha/numeric data between "".

Sample Data (one row):"900085322P"|"Fred Flinstone"|"GUMBY"|" "|"83562315"|"00010101"|"M"|"FRED"|"9999999999"|"Address1"|" "|"city"|"state"|"zip"|"US"|"street"|" "|"city"|"State"|"Zipcode"|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|"021"|"Y"|"Y"|"FILLER"|""|""|""|""|""|""|""|"1538637137"|"003"|"20200601"|"20200630"|" "|"Address1"|" "|"City"|"State"|"Zip"|""

Expected output:

""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|""|"",etc...

Solution

  • Looking at you data, there are no escaped double quotes, so you might use a pattern that matches between the double quotes, but should also not match a pipe char:

    "\K[^"|]+(?=")
    

    Explanation

    • " Match a double quote
    • \K Forget what is matched so far
    • [^"|]* match 1+ chars except a double quote or pipe
    • (?=") Assert a double quote to the right

    See a regex demo.

    enter image description here

    If there can be escaped double quotes:

    "\K[^"|\\]*(?:\\.[^"|\\]*)*(?=")
    

    Regex demo