I have created a script where I read from a list of column names that need to be deleted and specified a list of directories to scan for the sql and python files to find a match using the python - but I used the python IN clause which returns not a specific match. Would like some assistance to do this using regex.
The column names in the sql and pythonfiles could start AND end with either a space, a single quote, or a comma. Appreciate it in advance.
Here is my code - I marked the two lines of code that need to be changed to regex
import glob
# read the list of column_fields_to_delete.txt into a list
columns_to_delete_file = 'column_fields_to_delete.txt'
with open(columns_to_delete_file) as f:
columns_file = [line.strip() for line in f]
delete_columns_list = [column for column in columns_file]
# specify directories from which to scan all sql files from
directories = ['/users/xx/sql/**/*.sql',
'/users/xx/Python/**/*.py']
output_lines = list()
for directory in directories:
for file in glob.glob(directory, recursive=True):
try:
with open(file, 'r') as f:
contents = f.read()
exception_columns = list()
for column_name in delete_columns_list:
if column_name in contents.upper(): #--------- this needs to be changed to re.findall()
exception_columns.append(column_name) #--------- this may need to be modified as well
if exception_columns:
print(f"{file} file contains exception columns {exception_columns}\n\n")
except:
pass
The expected output is to print each sql or python file that references any of the columns from the delete_column_list, followed by the actual columns with the exact matches.
The regex that you can use to achieve your purpose is:
r'[\s,\'"]' + column_name + r'[\s,\'"]'
along with re.findAll
.
I have made few adjustments to your code. The code looks somewhat like below:
import glob
import re
# read the list of column_fields_to_delete.txt into a list
columns_to_delete_file = 'column_fields_to_delete.txt'
with open(columns_to_delete_file) as f:
columns_file = [line.strip() for line in f]
delete_columns_list = [column for column in columns_file]
# specify directory from which to scan all sql files from
directory = '/users/xx/sql/**/*.sql'
output_lines = list()
for file in glob.glob(directory, recursive=True):
try:
with open(file, 'r') as f:
contents = f.read()
exception_columns = list()
for column_name in delete_columns_list:
matches = re.findall(r'[\s,\'"]' + column_name + r'[\s,\'"]', contents, re.IGNORECASE)
if matches:
exception_columns.append(column_name)
if exception_columns:
print(f"{file} file contains exception columns {exception_columns}\n\n")
except:
pass
Explanation for the above code modifications:
re.findall()
function returns a list of all matches in the contents of the file.re.IGNORECASE
flag to make the search case-insensitive. You can change that as per your requirement.