Search code examples
regexeditpad

EditPad: Need a regex that handles multiple possible data formats


First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.

I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.

Here is some sample data:

3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre

Here is how the above data needs to end up:

114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre

So, there are numerous rules, such as:

  • In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
  • In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
  • If a question mark is present, everything from the mark onwards should be removed.
  • If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
  • If the string contains 2 underscores, averything after the first numeric string up to a dash (if present) should be removed, and the string "sm-" should be prepended.

Additionally there is other data that must be left untouched, including but not limited to:

113535|24905|24905

as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)

This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.


Solution

  • Jonathan, I can wrap all of those into one regex, except the last one (where you prepend sm- to a string that does not contain sm). It is not possible in this context, because we cannot capture "sm" to reuse in the replacement, and because there is no "conditional replacement" syntax in EPP.

    That being said, you can achieve what you want in EPP with two regexes and one macro to chain the two.

    Here is how.

    The solution below is tested in EPP.

    Regex 1

    1. Press Ctrl + Sh + F to enter Search / Replace mode
    2. Enter the following Search and Replace in the appropriate boxes
    3. At the top right of the Search bar, click the Favorite Searches pull-down, select "Add", give it a name, e.g. Regex 1

    Search:

    (?mx)^
    (?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
    |
    [^\r\n]*?-ao1?-\D*([^\r\n]+)
    |
    ([^\r\n?]*)(?=\?)[^\r\n]+
    |
    [^\r\n]*?-((?:sm|rad)-[^\r\n]+)
    

    Replace:

    \1\2\3\4\5
    

    Regex 2

    Same 1-2-3 steps as above.

    Search

    ^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)
    

    Replace

    sm-\1\2
    

    Chaining Regex 1 and Regex 2

    1. Top menu: Macros, Record Macro, give it a name.
    2. Click the Favorite searches pulldown, select Regex 1
    3. Hit Replace All.
    4. Click the Favorite searches pulldown, select Regex 2
    5. Hit Replace All.
    6. Macros, Stop recording.
    7. Whenever you want to do your sequence of replacements, pull it by name under the Macros menu.

    Testing This

    I have tested my "Jonathan macro" on your input. Here is the result:

    114851-bre
    114851
    113775-bre
    114441
    114441-bre
    114441
    114441-bre
    http://go.nlvid.com/results1/
    go.nlv/results1/
    sm-1359
    sm-1356-bre
    sm-1359-bre
    sm-1356
    rad-8905
    rad-8905-bre