Search code examples
pythonsqlexcelms-access

In a standard MS Access SQL query output that does not have any aliases, how do I replace the full names by their "first-letters" aliases?


I have a lot of queries from a bunch of MS Access databases. I read them out and split them into their SQL-blocks by the standard SQL keywords with:

Thus, you need to put the objects from the MS Access databases in an Excel file of that same format to go on with answering this question. You might say that this is too detailed, and yes, it is, but without doing so, you will not get the split of the FROM-block from the rest of the query, which is needed to answer the question at all.

The columns of that input Excel file:

enter image description here

enter image description here

enter image description here

enter image description here

Or altogether:

enter image description here

ID Datenbank Objektname LastUpdated Objekttyp Objektart SourceTableName Abfrage_SQL Fehler Select Into From Where Group_By Having Order_By New SQL Codes Mapping

The main SQL query is in column H (Abfrage_SQL, "Abfrage" means "Query"), while the split into blocks ranges from columns J to P. In the code, you will need column H and column L as the input.

You might get to an answer without this Excel input file and with other code, but you will not get around splitting the Code with some Regex, and I did not want to reinvent the wheel, therefore check the link how to get there. The queries from the MS Access databases at hand do not have aliases.

Mind that queries in your MS Access databases might have aliases. If you have them all the time, you do not need this question. But if they are there only sometimes, you need to change the code of the answer.

Task

I do not want to put the aliases with Regex or even by hand and in many steps. Instead, I want to run Python code on it that does it all in one go.

I need to replace the full table and view names with their standardized aliases. The alias shall be built with the first letters of each name that is split by "_" so that "my_random_table" becomes "mrt". If more than one full name is assigned to an abbreviation, the repeated abbreviation must be given an ascending number.

The full query from MS Access might look like this:

select my_random_table.* from my_random_table

This shall be shortened by aliases like this:

select mrt.* from my_random_table mrt

The Excel input file affords a list of 100+ queries in column H and their FROM-block in column L.

In a standard MS Access SQL query output that does not have any aliases, how do I replace the full names by their "first-letters" aliases? This shall be done with Python code that is run on an Excel input file. This Excel input file can be built with the help of the links listed above.

PS

The tags "excel" and "ms-access" are not the core of the question, they are not even needed. The answer can help in any other SQL settings as well. I make this clear since the answer takes up the output from MS Access and MS Excel, but you will get around that by re-writing the code for another software setting.


Solution

  • The Python code to get two new columns (New SQL Codes and Mapping) at at the right end of the input Excel file and output that as a new output file, with the first two filled queries (processed_count < 2) as debugging printouts, is:

    import pandas as pd
    import re
    
    # Define SQL keywords
    SQL_KEYWORDS = {
        'SELECT', 'FROM', 'WHERE', 'JOIN', 'INNER', 'OUTER', 'LEFT', 'RIGHT', 
        'ON', 'AND', 'OR', 'GROUP', 'BY', 'ORDER', 'DESC', 'ASC', 'HAVING', 
        'INSERT', 'UPDATE', 'DELETE', 'VALUES', 'SET', 'CREATE', 'DROP', 
        'ALTER', 'TABLE', 'VIEW', 'AS', 'DISTINCT', 'LIKE', 'IN', 'BETWEEN', 
        'EXISTS', 'NULL', 'IS', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
        'UNION', 'ALL', 'LIMIT', 'OFFSET', 'SELECTED', 'EXPLAIN', 'WITH', 
        'INTO'
    }
    
    def is_valid_object_name(full_name):
        """Checks if the object name is a table or view."""
        return full_name.upper() not in SQL_KEYWORDS
    
    def generate_short_name(full_name, existing_aliases):
        """Generates a short name from the full table name and ensures it is unique."""
        parts = re.split(r'[_-]+', full_name.strip('[]'))
        short_name = ''.join(part[0] for part in parts if part)  # Generate abbreviation
        short_name = short_name.lower()
    
        # Ensure uniqueness
        if short_name in existing_aliases:
            count = 2
            new_short_name = f"{short_name}{count}"
            while new_short_name in existing_aliases:
                count += 1
                new_short_name = f"{short_name}{count}"
            short_name = new_short_name
    
        existing_aliases.add(short_name)
        return short_name
    
    def process_queries(queries, from_clauses):
        """Processes SQL queries and replaces table names with aliases."""
        all_output_queries = []
        all_short_names = []
    
        processed_count = 0  # To keep track of the number of processed queries
    
        for query, from_clause in zip(queries, from_clauses):
            if query:  # Debugging for each filled query
                if processed_count < 2:  # Only for the first two filled queries
                    print("\n=== Debugging Output for Query ===")
                    print("Original Query:", query)
                    print("FROM Clause:", from_clause)
    
            short_names = {}
            existing_aliases = set()  # Collection for already used aliases
            output_query = query.strip()
            found_tables = re.findall(r'(?:^(\w+)$|((?:\[)?\b\w+(?:[-_]\w+)?\b(?:\])?)(?=\s*(?:INNER|JOIN|LEFT|RIGHT|OUTER|ON|FROM)))', from_clause)
    
            # found_tables = [name[0] if name[0] else name[1] if name[1] else '' for name in table_names]  # Extracting only the first group
            found_tables = [name[0] if name[0] else name[1] if name[1] else '' for name in found_tables]
    
            if processed_count < 2:  # Debug information for the first two queries
                print("Found Tables:", found_tables)  # Output of found tables
    
            for full_name in found_tables:
                if is_valid_object_name(full_name) and full_name not in short_names:
                    short_name = generate_short_name(full_name, existing_aliases)
                    short_names[full_name] = short_name
    
            if processed_count < 2:  # Debug information for the first two queries
                print("Generated Aliases:", short_names)  # Output of generated aliases
    
            # Inserting aliases in the FROM block
            for full_name, short_name in short_names.items():
                if full_name.startswith('[') and full_name.endswith(']'):
                    # Remove the square brackets for the substitution
                    pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]')
                else:
                    pattern = r'\b' + re.escape(full_name) + r'\b'
    
                from_clause = re.sub(
                    pattern,
                    f'{full_name} {short_name}',  # Full name and alias
                    from_clause, count=1
                )
                
                if processed_count < 2:  # Debug information only for the first two queries
                    print(f"Inserted: {full_name} with Alias {short_name}")
    
            # Replacing table names with aliases in the SELECT query
            for full_name, short_name in short_names.items():
                if full_name.startswith('[') and full_name.endswith(']'):
                    # Remove the square brackets for the substitution
                    pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]')
    
                    # if '[' in full_name:  # Debug information for the first two queries
                    #     print("pattern:", pattern)  # Output of pattern
                else:
                    pattern = r'\b' + re.escape(full_name) + r'\b'
                output_query = re.sub(pattern, short_name, output_query)
    
            if processed_count < 2:  # Debug information for the first two queries
                print("Query after Alias Replacements:", output_query)
    
            # Inserting the FROM block into the complete query
            output_query = re.sub(r'\bFROM\b.*', f'FROM {from_clause}', output_query, count=1)
    
            if processed_count < 2:  # Debug information for the first two queries
                print("Query after Adding the FROM Clause:", output_query)
    
            # Column names prefixed with table names should not be changed
            for full_name, short_name in short_names.items():
                if full_name.startswith('[') and full_name.endswith(']'):
                    # Remove the square brackets for the substitution
                    pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]\.')
                else:
                    pattern = r'\b' + re.escape(full_name) + r'\.'
                output_query = re.sub(pattern, f'{short_name}.', output_query)
    
            if processed_count < 2:  # Debug information for the first two queries
                print("Processed Query:", output_query)
    
            all_output_queries.append(output_query)  # Add the processed query
            all_short_names.append(short_names)       # Add the abbreviations
            processed_count += 1  # Increment the processed count
    
        return all_output_queries, all_short_names
        
    # Read the Excel file
    input_file = 'input_file.xlsx'  # Path to the input file
    output_file = 'output_file.xlsx'  # Path to the output file
    
    # Load the Excel file and read relevant data
    df = pd.read_excel(input_file)
    
    # Extract SQL queries from column H (index 7) and the FROM clause from column L (index 11)
    raw_queries = df.iloc[:, 7].dropna().tolist()  # Column H
    from_clauses = df.iloc[:, 11].dropna().tolist()  # Column L
    
    # Process queries
    processed_queries, abbreviations = process_queries(raw_queries, from_clauses)
    
    # Add new SQL codes and mappings to DataFrame
    df['New SQL Codes'] = ''
    df['Mapping'] = ''
    
    processed_index = 0  # Index for processed queries
    for i in range(len(df)):  # Loop through original queries
        query = df.at[i, 'Abfrage_SQL']
        if isinstance(query, str) and query:  # Check
            df.at[i, 'New SQL Codes'] = processed_queries[processed_index]
    
            # Create mapping only for table names, excluding column names
            mapping_str = ', '.join(f"{k}: {v}" for k, v in abbreviations[processed_index].items()
                                     if is_valid_object_name(k) and not '.' in k)  # Only use valid table names
    
            df.at[i, 'Mapping'] = mapping_str.strip()  # Prepare mapping
    
            processed_index += 1  # Only increment for processed queries
    
    # Save the new data to an Excel file
    with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
        df.to_excel(writer, index=False, sheet_name='SQL_Queries')
    
    print(f'Data has been saved in {output_file}.')
    

    The New SQL Codes column lists all new "aliased" SQL queries. The Mapping column lists all tables and views with their aliases, so that:

    SELECT my_random_table.*, my_raw_tbl.* 
    FROM my_random_table 
    INNER JOIN my_raw_tbl 
    ON my_random_table.id = my_raw_tbl.random_tbl_id
    

    ..becomes:

    SELECT mrt.*, mrt2.* 
    FROM my_random_table mrt 
    INNER JOIN my_raw_tbl mrt2 
    ON mrt.id = mrt2.random_tbl_id
    

    The code can also handle tables embedded in squared brackets, for example "[abc-cba]" will be understood as alias "ac".