I have a lot of queries from a bunch of MS Access databases. I read them out and split them into their SQL-blocks by the standard SQL keywords with:
Thus, you need to put the objects from the MS Access databases in an Excel file of that same format to go on with answering this question. You might say that this is too detailed, and yes, it is, but without doing so, you will not get the split of the FROM
-block from the rest of the query, which is needed to answer the question at all.
The columns of that input Excel file:
Or altogether:
ID Datenbank Objektname LastUpdated Objekttyp Objektart SourceTableName Abfrage_SQL Fehler Select Into From Where Group_By Having Order_By
New SQL Codes
Mapping
The main SQL query is in column H
(Abfrage_SQL
, "Abfrage" means "Query"), while the split into blocks ranges from columns J
to P
. In the code, you will need column H
and column L
as the input.
You might get to an answer without this Excel input file and with other code, but you will not get around splitting the Code with some Regex, and I did not want to reinvent the wheel, therefore check the link how to get there. The queries from the MS Access databases at hand do not have aliases.
Mind that queries in your MS Access databases might have aliases. If you have them all the time, you do not need this question. But if they are there only sometimes, you need to change the code of the answer.
I do not want to put the aliases with Regex or even by hand and in many steps. Instead, I want to run Python code on it that does it all in one go.
I need to replace the full table and view names with their standardized aliases. The alias shall be built with the first letters of each name that is split by "_" so that "my_random_table" becomes "mrt". If more than one full name is assigned to an abbreviation, the repeated abbreviation must be given an ascending number.
The full query from MS Access might look like this:
select my_random_table.* from my_random_table
This shall be shortened by aliases like this:
select mrt.* from my_random_table mrt
The Excel input file affords a list of 100+ queries in column H
and their FROM
-block in column L
.
In a standard MS Access SQL query output that does not have any aliases, how do I replace the full names by their "first-letters" aliases? This shall be done with Python code that is run on an Excel input file. This Excel input file can be built with the help of the links listed above.
The tags "excel" and "ms-access" are not the core of the question, they are not even needed. The answer can help in any other SQL settings as well. I make this clear since the answer takes up the output from MS Access and MS Excel, but you will get around that by re-writing the code for another software setting.
The Python code to get two new columns (New SQL Codes
and Mapping
) at at the right end of the input Excel file and output that as a new output file, with the first two filled queries (processed_count < 2
) as debugging printouts, is:
import pandas as pd
import re
# Define SQL keywords
SQL_KEYWORDS = {
'SELECT', 'FROM', 'WHERE', 'JOIN', 'INNER', 'OUTER', 'LEFT', 'RIGHT',
'ON', 'AND', 'OR', 'GROUP', 'BY', 'ORDER', 'DESC', 'ASC', 'HAVING',
'INSERT', 'UPDATE', 'DELETE', 'VALUES', 'SET', 'CREATE', 'DROP',
'ALTER', 'TABLE', 'VIEW', 'AS', 'DISTINCT', 'LIKE', 'IN', 'BETWEEN',
'EXISTS', 'NULL', 'IS', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END',
'UNION', 'ALL', 'LIMIT', 'OFFSET', 'SELECTED', 'EXPLAIN', 'WITH',
'INTO'
}
def is_valid_object_name(full_name):
"""Checks if the object name is a table or view."""
return full_name.upper() not in SQL_KEYWORDS
def generate_short_name(full_name, existing_aliases):
"""Generates a short name from the full table name and ensures it is unique."""
parts = re.split(r'[_-]+', full_name.strip('[]'))
short_name = ''.join(part[0] for part in parts if part) # Generate abbreviation
short_name = short_name.lower()
# Ensure uniqueness
if short_name in existing_aliases:
count = 2
new_short_name = f"{short_name}{count}"
while new_short_name in existing_aliases:
count += 1
new_short_name = f"{short_name}{count}"
short_name = new_short_name
existing_aliases.add(short_name)
return short_name
def process_queries(queries, from_clauses):
"""Processes SQL queries and replaces table names with aliases."""
all_output_queries = []
all_short_names = []
processed_count = 0 # To keep track of the number of processed queries
for query, from_clause in zip(queries, from_clauses):
if query: # Debugging for each filled query
if processed_count < 2: # Only for the first two filled queries
print("\n=== Debugging Output for Query ===")
print("Original Query:", query)
print("FROM Clause:", from_clause)
short_names = {}
existing_aliases = set() # Collection for already used aliases
output_query = query.strip()
found_tables = re.findall(r'(?:^(\w+)$|((?:\[)?\b\w+(?:[-_]\w+)?\b(?:\])?)(?=\s*(?:INNER|JOIN|LEFT|RIGHT|OUTER|ON|FROM)))', from_clause)
# found_tables = [name[0] if name[0] else name[1] if name[1] else '' for name in table_names] # Extracting only the first group
found_tables = [name[0] if name[0] else name[1] if name[1] else '' for name in found_tables]
if processed_count < 2: # Debug information for the first two queries
print("Found Tables:", found_tables) # Output of found tables
for full_name in found_tables:
if is_valid_object_name(full_name) and full_name not in short_names:
short_name = generate_short_name(full_name, existing_aliases)
short_names[full_name] = short_name
if processed_count < 2: # Debug information for the first two queries
print("Generated Aliases:", short_names) # Output of generated aliases
# Inserting aliases in the FROM block
for full_name, short_name in short_names.items():
if full_name.startswith('[') and full_name.endswith(']'):
# Remove the square brackets for the substitution
pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]')
else:
pattern = r'\b' + re.escape(full_name) + r'\b'
from_clause = re.sub(
pattern,
f'{full_name} {short_name}', # Full name and alias
from_clause, count=1
)
if processed_count < 2: # Debug information only for the first two queries
print(f"Inserted: {full_name} with Alias {short_name}")
# Replacing table names with aliases in the SELECT query
for full_name, short_name in short_names.items():
if full_name.startswith('[') and full_name.endswith(']'):
# Remove the square brackets for the substitution
pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]')
# if '[' in full_name: # Debug information for the first two queries
# print("pattern:", pattern) # Output of pattern
else:
pattern = r'\b' + re.escape(full_name) + r'\b'
output_query = re.sub(pattern, short_name, output_query)
if processed_count < 2: # Debug information for the first two queries
print("Query after Alias Replacements:", output_query)
# Inserting the FROM block into the complete query
output_query = re.sub(r'\bFROM\b.*', f'FROM {from_clause}', output_query, count=1)
if processed_count < 2: # Debug information for the first two queries
print("Query after Adding the FROM Clause:", output_query)
# Column names prefixed with table names should not be changed
for full_name, short_name in short_names.items():
if full_name.startswith('[') and full_name.endswith(']'):
# Remove the square brackets for the substitution
pattern = re.escape(full_name).replace(r'\[', r'\[\b').replace(r'\]', r'\b\]\.')
else:
pattern = r'\b' + re.escape(full_name) + r'\.'
output_query = re.sub(pattern, f'{short_name}.', output_query)
if processed_count < 2: # Debug information for the first two queries
print("Processed Query:", output_query)
all_output_queries.append(output_query) # Add the processed query
all_short_names.append(short_names) # Add the abbreviations
processed_count += 1 # Increment the processed count
return all_output_queries, all_short_names
# Read the Excel file
input_file = 'input_file.xlsx' # Path to the input file
output_file = 'output_file.xlsx' # Path to the output file
# Load the Excel file and read relevant data
df = pd.read_excel(input_file)
# Extract SQL queries from column H (index 7) and the FROM clause from column L (index 11)
raw_queries = df.iloc[:, 7].dropna().tolist() # Column H
from_clauses = df.iloc[:, 11].dropna().tolist() # Column L
# Process queries
processed_queries, abbreviations = process_queries(raw_queries, from_clauses)
# Add new SQL codes and mappings to DataFrame
df['New SQL Codes'] = ''
df['Mapping'] = ''
processed_index = 0 # Index for processed queries
for i in range(len(df)): # Loop through original queries
query = df.at[i, 'Abfrage_SQL']
if isinstance(query, str) and query: # Check
df.at[i, 'New SQL Codes'] = processed_queries[processed_index]
# Create mapping only for table names, excluding column names
mapping_str = ', '.join(f"{k}: {v}" for k, v in abbreviations[processed_index].items()
if is_valid_object_name(k) and not '.' in k) # Only use valid table names
df.at[i, 'Mapping'] = mapping_str.strip() # Prepare mapping
processed_index += 1 # Only increment for processed queries
# Save the new data to an Excel file
with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='SQL_Queries')
print(f'Data has been saved in {output_file}.')
The New SQL Codes
column lists all new "aliased" SQL queries. The Mapping
column lists all tables and views with their aliases, so that:
SELECT my_random_table.*, my_raw_tbl.*
FROM my_random_table
INNER JOIN my_raw_tbl
ON my_random_table.id = my_raw_tbl.random_tbl_id
..becomes:
SELECT mrt.*, mrt2.*
FROM my_random_table mrt
INNER JOIN my_raw_tbl mrt2
ON mrt.id = mrt2.random_tbl_id
The code can also handle tables embedded in squared brackets, for example "[abc-cba]" will be understood as alias "ac".