Search code examples
regexpython-3.7

getting exact match for list of columns name from files


I have created a script where I read from a list of column names that need to be deleted and specified a list of directories to scan for the sql and python files to find a match using the python - but I used the python IN clause which returns not a specific match. Would like some assistance to do this using regex.

The column names in the sql and pythonfiles could start AND end with either a space, a single quote, or a comma. Appreciate it in advance.

Here is my code - I marked the two lines of code that need to be changed to regex

    import glob        

    # read the list of column_fields_to_delete.txt into a list
    columns_to_delete_file = 'column_fields_to_delete.txt'
    
    with open(columns_to_delete_file) as f:
        columns_file = [line.strip() for line in f]
    

    delete_columns_list = [column for column in columns_file]
    
    # specify directories from which to scan all sql files from
    directories = ['/users/xx/sql/**/*.sql',
                   '/users/xx/Python/**/*.py']
    
    output_lines = list()
    
    for directory in directories:
       for file in glob.glob(directory, recursive=True):
        try:
            with open(file, 'r') as f:
                contents = f.read()
            exception_columns = list()
            for column_name in delete_columns_list:
                if column_name in contents.upper():        #--------- this needs to be changed to re.findall()
                   exception_columns.append(column_name)   #--------- this may need to be modified as well 
    
            if exception_columns:
                print(f"{file} file contains exception columns {exception_columns}\n\n")
            
        except:
        pass

The expected output is to print each sql or python file that references any of the columns from the delete_column_list, followed by the actual columns with the exact matches.


Solution

  • The regex that you can use to achieve your purpose is:

    r'[\s,\'"]' + column_name + r'[\s,\'"]'
    

    along with re.findAll.

    I have made few adjustments to your code. The code looks somewhat like below:

    import glob
    import re        
    
    # read the list of column_fields_to_delete.txt into a list
    columns_to_delete_file = 'column_fields_to_delete.txt'
    
    with open(columns_to_delete_file) as f:
        columns_file = [line.strip() for line in f]
    
    delete_columns_list = [column for column in columns_file]
    
    # specify directory from which to scan all sql files from
    directory = '/users/xx/sql/**/*.sql'
    
    output_lines = list()
    
    for file in glob.glob(directory, recursive=True):
        try:
            with open(file, 'r') as f:
                contents = f.read()
            exception_columns = list()
            for column_name in delete_columns_list:
                matches = re.findall(r'[\s,\'"]' + column_name + r'[\s,\'"]', contents, re.IGNORECASE)
                if matches:
                    exception_columns.append(column_name)
    
            if exception_columns:
                print(f"{file} file contains exception columns {exception_columns}\n\n")
        except:
            pass
    

    Explanation for the above code modifications:

    • The regex pattern that I used matches any column name that is surrounded by whitespace, commas, single quotes, or double quotes.
    • The re.findall() function returns a list of all matches in the contents of the file.
    • I have also used re.IGNORECASE flag to make the search case-insensitive. You can change that as per your requirement.
    • If there are any matches:
      • append the column name to the exception_columns list, and then print the filename and the list of exception columns.