Search code examples
pythonexcelt-sqlms-access

How can I get TSQL from easy MS Access SQL with little to no handiwork?


I have 500+ queries in a bunch of MS Access databases. The queries are rather easy.

  • (1.) I read them out with VBA into an Excel file as columns A to H, with the columns "ID, Datenbank, Objektname, LastUpdated, Objekttyp, Objektart, SourceTableName, Abfrage_SQL", with the last column being the query,
  • (2.) split the query with Regex into columns I to P as the split SQL blocks "Select, Into, From, Where, Group_By, Having, Order_By",
  • (3.) shortened the code with aliases into two new columns Q (New SQL Codes) and R (Mapping),
  • and my aim is to make (4.) a column S (TSQL) as the TSQL that I can take to feed an SSIS data flow.

For (3.), see In a standard MS Access SQL query output that does not have any aliases, how do I replace the full names by their "first-letters" aliases?, and there are also the links (1.)+(2.).

I need to change the MS Access VBA that is embedded in MS Access SQL, like Format() functions and special formats like #my_date#, and I do not want to Regex-replace the code dozens of times by hand in some Regex-Search-Replace menu. Instead, I want to loop over the replacements with Python, taking the output_file.xlsx from (3.) as the input and renaming the new output to output_file_tsql.xlsx. The output of the new TSQL code shall be put in a new column S (TSQL).

Which kind of Regex replacements might help as patterns for anyone to begin with? I am sure that the patterns in my 500+ queries are only a small share of what you can run into, but on the other hand, they should be a good sample for a cold start. You will have other patterns as well so that you cannot rely only on the examples. But then just answer and share what you found.

Is there a better way to get the TSQL from MS Access SQL than by Regex replacements? Any tool or trick can be an answer to this question.

How can I get TSQL from easy MS Access SQL with little to no handiwork?

(Old question was: "Which Regex replacements help when rewriting MS Access SQL queries as mere TSQL queries? How can these be looped over with Excel as input and output?")


Solution

  • This code is mainly about Regex replacements in a loop, which will give you TSQL-only code from "MS Access SQL(VBA)" queries for the given 500+ queries set. As said above, you will surely run into other patterns as well, this is just to help how the frame and the patterns can look like.

    import pandas as pd
    import re
    
    # Function to convert Access SQL to T-SQL
    def convert_access_to_tsql(sql_string):
        if pd.isnull(sql_string) or sql_string.strip() == "":
            return ""  # Skip empty cells
    
        # Replace commas in the SELECT clause with a newline before each comma
        def replace_commas_in_select(match):
            select_clause = match.group(1)  # Captures the SELECT line
            other_part = match.group(2)  # Captures the rest (INTO, FROM, WHERE)
    
            # Replace commas in the select_clause with newline before each comma
            select_clause = re.sub(r',', r'\r\n,', select_clause)
            
            return f"{select_clause} {other_part}"
            
        select_pattern = r'^(SELECT.*?)(INTO|FROM|WHERE|$)'
        sql_string = re.sub(select_pattern, replace_commas_in_select, sql_string, flags=re.MULTILINE)
    
        # List of regex patterns and replacements
        replacements = [
            (r'Format\(\[?(\w+)?\]?!\[?(\w+)\]?,\s*""(.+?)""\)', r'convert(varchar(10), \1.\2, <Format function for \3>)'),
            (r'\[?(\w+)\]!?\[?(\w+)\]?', r'\1.\2'),
            (r'AS \[(\w+)\]', r'AS \1'),
            (r'\bJOIN\s+\[?(\w+)?\]?!\[?(\w+)\]\s+ON', r'JOIN \1.\2 ON'),
            (r'\bIs\b', r'IS'),
            (r'\bNull\b', r'NULL'),
            (r'\bOr\b', r'OR'),
            (r'convert\(varchar\(\d+\), \[?(\w+)\]?!\?(\w+)\?\)', r'convert(datetime, \1.\2)'),
            (r'Like "(\w+)\*"', r"LIKE '\1%'"),  # Convert asterisks at the end to T-SQL form
            (r'Like "\*(\w+)"', r"LIKE '%\1'"),  # Convert asterisks at the beginning to T-SQL form
            (r'Like "\*(\w+)\*"', r"LIKE '%\1%'"),  # Convert asterisks at both ends to T-SQL form
            (r'#(\d{1,2})/(\d{1,2})/(\d{4})#', r"'\3-\1-\2'"),  # Correctly convert date format
            (r'Format\(\[?([\w|\.]+)\]?,\s*\"dd/mm/yyyy\"\)', r'convert(varchar(10), \1, 115)'),  # dd/mm/yyyy Japanese
            (r'Format\(\[?([\w|\.]+)\]?,\s*\"yyyy-mm-dd\"\)', r'convert(varchar(10), \1, 120)'),  # yyyy-mm-dd ODBC canonical
            (r'Format\(\[?([\w|\.]+)\]?,\"yyyy\"', r'convert(varchar(4), \1, 112)'),  # yyyy (from yyyymmdd)
            (r'Format\(\[?([\w|\.]+)\]?,\"dd\"', r'convert(varchar(2), \1, 104)'),  # dd
            (r'Date\(\)', r'CAST(GETDATE() AS Date)'),  # Replace VBA function
            (r'"\s*([^"]*)\s*"', r"'\1'"),  # Replace "Text" with 'Text'
            (r'!(?!\[)', '.'),  # Replace ! with . except after [
            (r'\s*(INTO|AND|ON)', r'\r\n\1'),  # Replace INTO, AND, ON with line breaks
            (r'\s*AND', r'\r\n    AND'),  # Four spaces in front of AND
            (r'\s*ON', r'\r\n    ON'),  # Four spaces in front of ON      
            (r'\s*(JOIN\s+\w*|\w+\s+JOIN\s+\w*)', r'\r\n\1'),  # All types of JOIN with line breaks
            (r'(?<!\w)\(\s*([a-zA-Z_][a-zA-Z0-9_.]*)\s*\)', r'\1'), # Remove parentheses (unless from a function) for simple dot notation (Step 1)
            (r'(?<!#)(\(\s*[^()]*![^()]*\))', r'1 = 1 --\1\n'),  # Comment out formular fields with exclamation marks, line break (Step 2)
            (r'\n\s*\n', r'\n'), # Drop line break (Step 3)
            (r'\bINTO\b', '--INTO'), # INTO blocks for nested queries in MS Access only
        ]
    
        # Loop through the replacements
        for pattern, replacement in replacements:
            try:
                sql_string = re.sub(pattern, replacement, sql_string, flags=re.MULTILINE)
            except re.error as e:
                print(f"Regex error: {e} in pattern: {pattern} for string: {sql_string}")
                raise 
    
        # Post-processing conversion of date strings to 'YYYY-MM-DD' format
        sql_string = re.sub(
            r"'(\d{1,4})-(\d{1,2})-(\d{1,2})'",
            lambda m: f"'{int(m.group(1)):04}-{int(m.group(2)):02}-{int(m.group(3)):02}'",
            sql_string
        )
    
        return sql_string
    
    # Load the Excel file and read the 'New SQL Codes' column
    input_file = 'output_file.xlsx'
    output_file = 'output_file_tsql.xlsx'
    
    # Read the data
    df = pd.read_excel(input_file)
    
    # Ensure that the 'New SQL Codes' column exists
    if 'New SQL Codes' not in df.columns:
        print("The column 'New SQL Codes' does not exist in the Excel file.")
    else:
        # Process each cell in the 'New SQL Codes' column and write the result to column 'TSQL'
        df['TSQL'] = df['New SQL Codes'].apply(convert_access_to_tsql)
    
        # Save the modified DataFrame as a new Excel file
        df.to_excel(output_file, index=False)
    
    print(f"The file has been successfully saved as {output_file}.")
    

    The output is the column S filled with 500+ TSQL queries without any MS Access functions and formats and with line breaks before each column and before some SQL keywords, AND and ON four spaces indented.

    With its output at hand, you can also sort it by the levels of dependencies with the help of How do I sort "SELECT INTO" queries that are built on top of each other by their "INTO" / "FROM" table links?.