Search code examples
pythonregexnlpfindall

Finding the right word and row in the Financial Statement text file


I've used tesseract OCR in Python to convert the Financial statement pdfs to text files, while converting the long whitespaces into ";". So the text file looks pretty nice and the tables are looking good.

Using an example found here https://cdn.corporatefinanceinstitute.com/assets/AMZN-Cash-Flow.png

The table would be like the following:
Stock-based compensation;2,119;2,975;4,215
Other operating expense, net;155;160;202
Other expense (income), net;250;(20);(292)
Deferred income taxes;81;(246);(29)
...

OK, so the Task is to find the first sum after e.g "Stock-based compensation" -> 2,119. I've faced at least 3 issues with this:

1st problem is that I always have the whole financial statement pdf to start with, which contains for example 20 pages and can contain the word "Stock-based compensation" multiple times in sentences like "..is the date on which a stock-based compensation is granted...".

2nd problem is that to find the right table in the financial statement. There can be various smaller tables where "Stock-based compensation" can occur. However, in this case lets say we are looking for table called "Consolidated statements of cash flow", not for example "Estimated budget for next fiscal year" etc.

3rd problem is the word itself. "Stock-based compensation" can vary in different forms, like "Stock based compensation", "Compensation", "Stock based compensation & other compensation" etc. However, as we now that this "Stock based compensation" in a form or another shall be anyway in the right table, it shouldn't be a major issue to find the right line.

I've used for example regex to narrow down the options for the right word I'm looking for like this

def find_sum(word_to_look_for):
txt_file = r"fina.txt"
find = pattern
digitals = "\d+|;" #trying to find if any digits or ";" can be found on the row
with open(txt_file, "r") as in_text:
    for row in in_text:
        if re.findall(find, row, flags=re.IGNORECASE):
            if re.findall(digitals, row):
                expense_row = row.split(";")[1].strip("-")
                expenses = re.sub("[^\d\,]", "", expense_row) #if e.g 2.512,00
                return expenses
            else:
                pass

This solves some issues, but I'm currently thinking about whether I should implement ML or NLP tech into this case, or would this be easy enough to solve with regex, just narrowing down the possible lines with n-amount of if-statements?


Solution

  • Use words to look for with whitespace characters, regular expression can be create in the function with .* joiners to allow matching anything between those words.

    def find_sum(word_to_look_for):
        txt_file = r"fina.txt"
        find = ".*".join([re.escape(w) for w in word_to_look_for.split()]) + r"\D*(\d+(?:[,.]\d+)*)"
        with open(txt_file, "r") as in_text:
            for row in in_text:
                match = re.search(find, row, flags=re.IGNORECASE)
                if match:
                    return match.group(1)
        return ""
    

    Python code:

    import re
    
    def find_sum(word_to_look_for):
        txt_file = r"fina.txt"
        find = ".*".join([re.escape(w) for w in word_to_look_for.split()]) + r"\D*(\d+(?:[,.]\d+)*)"
        in_text = ['...', 'XXX', 'Stock-based compensation;2,234.55']
        #with open(txt_file, "r") as in_text:
        for row in in_text:
            match = re.search(find, row, flags=re.IGNORECASE)
            if match:
                return match.group(1)
        return ""
    
    wtlf = "Stock based compensation"
    print(find_sum(wtlf))
    

    Results: 2,234.55

    Regex:

    Stock.*based.*compensation\D*(\d+(?:[,.]\d+)*)
    

    EXPLANATION

    --------------------------------------------------------------------------------
      Stock                    'Stock'
    --------------------------------------------------------------------------------
      .*                       any character except \n (0 or more times
                               (matching the most amount possible))
    --------------------------------------------------------------------------------
      based                    'based'
    --------------------------------------------------------------------------------
      .*                       any character except \n (0 or more times
                               (matching the most amount possible))
    --------------------------------------------------------------------------------
      compensation             'compensation'
    --------------------------------------------------------------------------------
      \D*                      non-digits (all but 0-9) (0 or more times
                               (matching the most amount possible))
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        \d+                      digits (0-9) (1 or more times (matching
                                 the most amount possible))
    --------------------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the most amount
                                 possible)):
    --------------------------------------------------------------------------------
          [,.]                     any character of: ',', '.'
    --------------------------------------------------------------------------------
          \d+                      digits (0-9) (1 or more times
                                   (matching the most amount possible))
    --------------------------------------------------------------------------------
        )*                       end of grouping
    --------------------------------------------------------------------------------
      )                        end of \1