Search code examples
pythonstringpython-3.xstring-matchingedgar

String Matching in Python?


I'm having trouble matching strings in Python. What I'm trying to do is look for lines in documents like this and try to match each line to specific phrases. I'm reading in all the lines and parsing with Beautfiul soup into stripped strings, then iterating through a list of all the lines in the document. From there, I use the following code to match for the specific strings:

if row.upper() == ("AUDIT COMMITTEE REPORT" or "REPORT OF THE AUDIT COMMITTEE"):
    print("Found it!")
if "REPORT" in row.upper():
    print ("******"+row.upper()+"******")

When the code runs, I get the following output:

******COMPENSATION COMMITTEE REPORT******
******REPORT OF THE AUDIT COMMITTEE******
******REPORTING COMPLIANE******
******COMPENSATION COMMITTEE REPORT******
******REPORT OF THE AUDIT COMMITTEE******

The program never finds it when the string is being checked for equality, but when asked if a portion of it is in the string, it's able to find it without trouble. How does string matching working in Python, s.t. these events are occurring, and how can I fix it so that it'll make those exact phrases?

EDIT: Another note that should be made is that these documents are quite large, some exceeding 50 pages easily, and checking if the string is just in the row is not enough. It needs to be an exact match.


Solution

  • You could do something like this using list comprehension.

    row = '******AUDIT COMMITTEE REPORT******'
    match = ["AUDIT COMMITTEE REPORT", "REPORT OF THE AUDIT COMMITTEE"]
    is_match = sum([m in row.upper() for m in match])
    
    if is_match:
        print("Found it!")
    if "REPORT" in row.upper():
        print ("******"+row.upper()+"******")
    

    First we create a list of all possible matches, these could be loaded from a file, or be statically declared in the python code.

    match = ["AUDIT COMMITTEE REPORT", "REPORT OF THE AUDIT COMMITTEE"]
    

    Next we loop through all the possible matches and see if anything matches the string row. If something does match, a True boolean would be added to the list, and we can use that the determine if there was a match.

    is_match = sum([m in row.upper() for m in match])
    

    If you remove sum() you can see that the output of the list comprehension is simply a list of booleans.

    print([m in row.upper() for m in match])
    [True, False]
    

    If you want to be a little more efficient and simple, you could implement a function with a for loop.

    matches = ["AUDIT COMMITTEE REPORT", "REPORT OF THE AUDIT COMMITTEE"]
    def is_match(row):
        for match in matches:
            if match in row.upper():
                return True
        return False
    

    This loop will loop through all possible matches, if it find a match it will instantly return True, otherwise it will exit and return False.