getting exact match for list of columns name from files

I have created a script where I read from a list of column names that need to be deleted and specified a list of directories to scan for the sql and python files to find a match using the python - but I used the python IN clause which returns not a specific match. Would like some assistance to do this using regex.

The column names in the sql and pythonfiles could start AND end with either a space, a single quote, or a comma. Appreciate it in advance.

Here is my code - I marked the two lines of code that need to be changed to regex

    import glob        

    # read the list of column_fields_to_delete.txt into a list
    columns_to_delete_file = 'column_fields_to_delete.txt'
    
    with open(columns_to_delete_file) as f:
        columns_file = [line.strip() for line in f]
    

    delete_columns_list = [column for column in columns_file]
    
    # specify directories from which to scan all sql files from
    directories = ['/users/xx/sql/**/*.sql',
                   '/users/xx/Python/**/*.py']
    
    output_lines = list()
    
    for directory in directories:
       for file in glob.glob(directory, recursive=True):
        try:
            with open(file, 'r') as f:
                contents = f.read()
            exception_columns = list()
            for column_name in delete_columns_list:
                if column_name in contents.upper():        #--------- this needs to be changed to re.findall()
                   exception_columns.append(column_name)   #--------- this may need to be modified as well 
    
            if exception_columns:
                print(f"{file} file contains exception columns {exception_columns}\n\n")
            
        except:
        pass

The expected output is to print each sql or python file that references any of the columns from the delete_column_list, followed by the actual columns with the exact matches.

Solution

The regex that you can use to achieve your purpose is:

r'[\s,\'"]' + column_name + r'[\s,\'"]'

along with re.findAll.

I have made few adjustments to your code. The code looks somewhat like below:

import glob
import re        

# read the list of column_fields_to_delete.txt into a list
columns_to_delete_file = 'column_fields_to_delete.txt'

with open(columns_to_delete_file) as f:
    columns_file = [line.strip() for line in f]

delete_columns_list = [column for column in columns_file]

# specify directory from which to scan all sql files from
directory = '/users/xx/sql/**/*.sql'

output_lines = list()

for file in glob.glob(directory, recursive=True):
    try:
        with open(file, 'r') as f:
            contents = f.read()
        exception_columns = list()
        for column_name in delete_columns_list:
            matches = re.findall(r'[\s,\'"]' + column_name + r'[\s,\'"]', contents, re.IGNORECASE)
            if matches:
                exception_columns.append(column_name)

        if exception_columns:
            print(f"{file} file contains exception columns {exception_columns}\n\n")
    except:
        pass

Explanation for the above code modifications:

The regex pattern that I used matches any column name that is surrounded by whitespace, commas, single quotes, or double quotes.
The re.findall() function returns a list of all matches in the contents of the file.
I have also used re.IGNORECASE flag to make the search case-insensitive. You can change that as per your requirement.
If there are any matches:
- append the column name to the exception_columns list, and then print the filename and the list of exception columns.