Search code examples
pythonregextext-processing

get all the text between two newline characters(\n) of a raw_text using python regex


So I have several examples of raw text in which I have to extract the characters after 'Terms'. The common pattern I see is after the word 'Terms' there is a '\n' and also at the end '\n' I want to extract all the characters(words, numbers, symbols) present between these to \n but after keyword 'Terms'.

Some examples of text are given below:

1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n

The code I have written is given below:

def get_term_regex(s):
    raw_text = s
    term_regex1 = r'(TERMS\s*\\n(.*?)\\n)'

    try:
        if ('TERMS' or 'Terms') in raw_text:
            
            pattern1 = re.search(term_regex1,raw_text)
            #print(pattern1)
            return pattern1
    except:
        pass

But I am not getting any output, as there is no match.

The expected output is:

1) Direct deposit; Routing #256078514, acct. #160935
2) Due on receipt
3) NET 30 DAYS

Any help would be really appreciated.


Solution

  • Try the following:

    import re
    
    text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
    2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
    3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines
    
    for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
        print(m.group(2))
    
    1. Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.

    2. ('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!

      So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.