Search code examples
regexexcelbetweenwords

looking for 1-3 words preceded by 4 digit number followed by non alpha characters or 1-4 specific words


I am looking through a long list of data in excel. In this long list there is a column that has a mix of number characters (digits) and alpha characters, and non-alpha-numeric characters. – so a text column. I am trying to find 1-2 words in the middle of the cells in this column. The times these words appear are always after a four digit number and a space. (N.B 4 digit numbers do not appear before words I am not interested in).

I have a tool that lets me save lists of regular expressions in excel and have written a list of rules that will find the 4 digit numbers and two words after it provided they are between 3-11 characters each word, and then extract the match to a new column Sample : -

\b\d{4}\b\s*\b[A-Z]{11}\s*\b[A-Z]{11}\s*

\b\d{4}\b\s*\b[A-Z]{10}\s*\b[A-Z]{11}\s*

\b\d{4}\b\s*\b[A-Z]{9}\s*\b[A-Z]{11}\s*

\b\d{4}\b\s*\b[A-Z]{8}\s*\b[A-Z]{11}\s*

and

\b\d{4}\b\s*\b[A-Z]{11}\s*\b[A-Z]{10}\s*

\b\d{4}\b\s*\b[A-Z]{11}\s*\b[A-Z]{9}\s*

\b\d{4}\b\s*\b[A-Z]{11}\s*\b[A-Z]{8}\s*

b\d{4}\b\s*\b[A-Z]{11}\s*\b[A-Z]{7}\s*

This has created a list of 121 different combinations and works very well.
I am sure there was a simpler way of writing these than I’ve chosen but that is fine. Using copy and paste it took me about 10 minutes once I figured out how to write the rules, and takes 10 seconds to run the lot. –

The only problem is - in the instances where there is only one word I want to find there is sometimes 1 of 4 other alpha based words that appear after the first word that I do not want to include. These are : -

INTE
MANU
PMNT
Payment

What I want to do is include a section to my above regular expressions that excludes any second words where the above examples are found.

Is there a way to add something to my existing lists of rules that excludes these, or better yet something that will do what I want in a completely different and more elegant way as in very few cases there are actually 3 words i want to find - which to compensate for i'd need to up my list to 1331 combinations?

Any help in this would be wonderful as we get these exact lists of data every week and manually doing this is clearly impossible.


Solution

  • You can use the following regex equivalent to all your 121 patterns and your exclusion requirement:

    \b\d{4}\b\s*\b[A-Z]{3,11}\s*\b(?!(INTE|MANU|PMNT|Payment)\b)[A-Z]{3,11}\s*
                         ^^^^     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       ^^^^
    

    See DEMO