Search code examples
pythonregexpython-re

How to remove a specific pattern from re.findall() results


I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:

 'ERIN E. SCHNEIDER',
 'MONIQUE C. WINKLER',
 'JASON M. HABERMEYER',
 'MARC D. KATZ',
 'JESSICA W. CHAN',
 'RAHUL KOLHATKAR',
 'TSPU or taken',
 'TSPU or the',
 'TSPU only',
 'TSPU was',
 'TSPU and']

I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?

JINA L. CHOI (NY Bar No. 2699718)

ERIN E. SCHNEIDER (Cal. Bar No. 216114) schneidere@sec.gov

MONIQUE C. WINKLER (Cal. Bar No. 213031) winklerm@sec.gov

JASON M. HABERMEYER (Cal. Bar No. 226607) habermeyerj@sec.gov

MARC D. KATZ (Cal. Bar No. 189534) katzma@sec.gov

JESSICA W. CHAN (Cal. Bar No. 247669) chanjes@sec.gov

RAHUL KOLHATKAR (Cal. Bar No. 261781) kolhatkarr@sec.gov

  1. The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]

Solution

  • You can use

    \b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?
    

    See this regex demo. Details:

    • \b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
    • (?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
    • [A-Z]{4,} - four or more uppercase ASCII letters
    • (?:(?:\s+\w\.)?\s+\w+)? - an optional occurrence of:
      • (?:\s+\w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
      • \s+ - one or more whitespaces
      • \w+ - one or more word chars.

    In Python, you can use

    re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?', text)