Search code examples
pythonsqlexcelt-sqlms-access

How do I sort "SELECT INTO" queries that are built on top of each other by their "INTO" / "FROM" table links?


I need to migrate a dozen MS Access databases with 500+ queries to SSIS and therefore, I changed the code to TSQL, see How can I get TSQL from easy MS Access SQL with little to no handiwork?.

The queries in these projects are built on top of each other with materialised tables that you SELECT INTO MS Access so that you can fetch it in the next "FROM" block. There have been remarks that this should not be needed since MS Access does allow queries from scattered servers. And that the easy queries that are run here should not need the materialisation for performance. Yet, even if that is true, I cannot change the projects afterwards. They are as they are, and perhaps I am not the only one running into such a setting. Perhaps there are other good reasons for this setting. There are forms that allow you to filter some column and afterwards give out the filtered tables as a download with all the subquery steps that were needed. Therefore, the setting may still make sense even if it were not for performance or scattered servers.

How can I sort the TSQL queries one-dimensionally in an Excel file by their ancestry levels?

I need blocks that put together the query families inside a database, and on top of that, I would like to know the sort order for the queries inside these families.

The aim is to see at one sight how I should go on in SSIS to mirror the TSQL workflow. This is just a puzzle of dependencies, and it likely can be run on anything that has INTO and FROM in its SQL, with any tools and languages you can think of. I still flag this with MS Access, MS Excel, TSQL and Python, to narrow it down to my setting.

From the MS Excel input file that you can get from the link above, this can be done with just three columns as I know from self-answering the question, and getting the INTO block from the TSQL is such an easy Regex that you can quickly calculate it yourself without the link above:

  • Datenbank = MS Access database
  • Into = The "INTO" table cut out with Regex (see the link above)
  • TSQL = TSQL query that was made from the MS Access SQL

Example rows in the Excel file (first line for the column names):

Datenbank, INTO, TSQL
database1.accdb, tbl_INTO1, select 1 as test INTO tbl_INTO1
database1.accdb, tbl_INTO2, select test INTO tbl_INTO2 FROM tbl_INTO1
database1.accdb, tbl_INTO3, select test INTO tbl_INTO3 FROM tbl_INTO2
database1.accdb, tbl_INTO4, select test INTO tbl_INTO4 FROM tbl_INTO1
database1.accdb, tbl_INTO8, select 1 as test2 INTO tbl_INTO8
database1.accdb, tbl_INTO9, select test2 INTO tbl_INTO9 FROM tbl_INTO8

Clearly, tbl_INTO2 follows tbl_INTO1 and should therefore get a higher level in the tree of dependencies. But also tbl_INTO4 should be higher than tbl_INTO1. And tbl_INTO8 and tbl_INTO9 have nothing to do with the rest, they should be in their own family block.

This is about 500+ filled TSQL cells spread across a dozen MS access databases in that Excel file. How do I sort "SELECT INTO" queries that are built on top of each other by their "INTO" / "FROM" table links?


Solution

  • Here is the code that lets you sort by:

    1. "Datenbank" (database name)
    2. "Block" (query family with same ancestor)
    3. "Ancestor_Count" (level of the tree, but can count more than one parent ancestor)
    4. "Datenbank_ID" (database ID)

    Whether you sort it or not, this code gives you the "who is who" of each database's query families.

    import pandas as pd
    from collections import defaultdict
    
    # Step 1: Read the data
    input_file = 'output_file_tsql.xlsx'  # Input file
    df = pd.read_excel(input_file)
    
    # Step 2: Create a unique set of databases
    unique_databases = df['Datenbank'].unique() # English: database
    
    # Initialize a list for all results
    all_results = []
    
    def debug_database(database_name, debug=False):
        if debug:
            print(f"\n--- Debugging for Database: '{database_name}' ---")
    
        db_specific_df = df[df['Datenbank'] == database_name].copy()
        db_specific_df['Database_ID'] = range(len(db_specific_df))
    
        # Initialize columns
        db_specific_df['Block'] = ""
        db_specific_df['Descendants'] = ""
        db_specific_df['Descendant_Count'] = 0
        db_specific_df['Ancestors'] = ""
        db_specific_df['Ancestor_Count'] = 0
    
        def fill_descendants_and_ancestors():
            for idx, row in db_specific_df.iterrows():
                descendants = set()
                ancestors = set()
    
                def gather_descendants(table_name, visited):
                    if not table_name or not isinstance(table_name, str) or table_name in visited:
                        return
                    visited.add(table_name)
    
                    for i, r in db_specific_df.iterrows():
                        t_sql_str = str(r['TSQL']) if pd.notna(r['TSQL']) else ""
                        if table_name in t_sql_str:
                            target_inserted = str(r['Into']).strip() if pd.notna(r['Into']) else ""
                            if target_inserted and r['Database_ID'] != row['Database_ID']:
                                descendants.add(r['Database_ID'])
                                gather_descendants(target_inserted, visited)
    
                gather_descendants(str(row['Into']), set())
                db_specific_df.at[idx, 'Descendants'] = ', '.join(map(str, sorted(descendants)))
                db_specific_df.at[idx, 'Descendant_Count'] = len(descendants)
    
                for target_idx, target_row in db_specific_df.iterrows():
                    if str(row['Database_ID']) in str(target_row['Descendants']).split(', '):
                        ancestors.add(target_row['Database_ID'])
                db_specific_df.at[idx, 'Ancestors'] = ', '.join(map(str, sorted(ancestors)))
                db_specific_df.at[idx, 'Ancestor_Count'] = len(ancestors)
    
        fill_descendants_and_ancestors()
    
        def fill_blocks():
            for idx, row in db_specific_df.iterrows():
                block_ids = set()
                
                def collect_block_ids(current_ids, visited):
                    for current_id in current_ids:
                        if current_id in visited or current_id not in db_specific_df['Database_ID'].values:
                            continue
                        visited.add(current_id)
    
                        block_ids.add(current_id)
    
                        descendants = db_specific_df.loc[db_specific_df['Database_ID'] == current_id, 'Descendants'].values[0]
                        if isinstance(descendants, str):
                            descendant_ids = [int(n.strip()) for n in descendants.split(', ') if n.strip().isdigit()]
                            collect_block_ids(descendant_ids, visited)
    
                        ancestors = db_specific_df.loc[db_specific_df['Database_ID'] == current_id, 'Ancestors'].values[0]
                        if isinstance(ancestors, str):
                            ancestor_ids = [int(v.strip()) for v in ancestors.split(', ') if v.strip().isdigit()]
                            collect_block_ids(ancestor_ids, visited)
    
                collect_block_ids([row['Database_ID']], set())
                db_specific_df.at[idx, 'Block'] = ', '.join(map(str, sorted(block_ids)))
    
        fill_blocks()
    
        if debug:
            print("\n--- Results for the Database ---")
            for index, row in db_specific_df.iterrows():
                print(f"Database_ID: {row['Database_ID']} | Block: {row['Block']} | "
                      f"Descendants: {row['Descendants']} | Descendant_Count: {row['Descendant_Count']} | "
                      f"Ancestors: {row['Ancestors']} | Ancestor_Count: {row['Ancestor_Count']}")
    
        all_results.append(db_specific_df)
    
    for db_index, database in enumerate(unique_databases):
        debug = db_index < 2
        debug_database(database, debug=debug)
    
    final_df = pd.concat(all_results, ignore_index=True)
    
    # Append all original columns and the new ones
    final_columns = list(df.columns) + ['Database_ID', 'Ancestors', 'Ancestor_Count', 'Descendants', 'Descendant_Count', 'Block']
    final_df = final_df[final_columns]
    
    output_file_sorted = 'output_file_tsql_ancestry.xlsx'
    final_df.to_excel(output_file_sorted, index=False)
    
    print(f"The sorted output file for all databases has been saved: {output_file_sorted}")
    

    It leads to an example debugging output like this:

    --- Debugging for Database: 'some_db1.accdb' ---
    
    --- Results for the Database ---
    Database_ID: 0 | Block: 0 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 1 | Block: 1 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 2 | Block: 2 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 3 | Block: 3 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 4 | Block: 4 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 5 | Block: 5 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 6 | Block: 6 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 7 | Block: 7 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 8 | Block: 8 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 9 | Block: 9 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 10 | Block: 10 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 11 | Block: 11 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 12 | Block: 12 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 13 | Block: 13 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 14 | Block: 14, 15, 16, 17, 21, 22 | Descendants: 16, 17, 21, 22 | Descendant_Count: 4 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 15 | Block: 14, 15, 16, 17, 21, 22 | Descendants: 16, 17, 21, 22 | Descendant_Count: 4 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 16 | Block: 14, 15, 16, 17, 21, 22 | Descendants: 17, 21, 22 | Descendant_Count: 3 | Ancestors: 14, 15 | Ancestor_Count: 2
    Database_ID: 17 | Block: 14, 15, 16, 17, 21, 22 | Descendants: 21, 22 | Descendant_Count: 2 | Ancestors: 14, 15, 16 | Ancestor_Count: 3
    Database_ID: 18 | Block: 18 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 19 | Block: 19 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 20 | Block: 20 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 21 | Block: 14, 15, 16, 17, 21, 22 | Descendants:  | Descendant_Count: 0 | Ancestors: 14, 15, 16, 17 | Ancestor_Count: 4
    Database_ID: 22 | Block: 14, 15, 16, 17, 21, 22 | Descendants:  | Descendant_Count: 0 | Ancestors: 14, 15, 16, 17 | Ancestor_Count: 4
    
    --- Debugging for Database: 'some_db2.accdb' ---
    
    --- Results for the Database ---
    Database_ID: 0 | Block: 0 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 1 | Block: 1 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 2 | Block: 2 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 3 | Block: 3 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 4 | Block: 4 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 5 | Block: 5 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 6 | Block: 6 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 7 | Block: 7 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 8 | Block: 8 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 9 | Block: 9 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 10 | Block: 10 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 11 | Block: 11 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 12 | Block: 12 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 13 | Block: 13 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 14 | Block: 14 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 15 | Block: 15 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 16 | Block: 16 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 17 | Block: 17 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 18 | Block: 18 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 19 | Block: 19 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 20 | Block: 20 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 21 | Block: 21 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 22 | Block: 22 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 23 | Block: 23 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 24 | Block: 24 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 25 | Block: 25 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 26 | Block: 26 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 27 | Block: 27 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 28 | Block: 28 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 29 | Block: 29 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 30 | Block: 30 | Descendants:  | Descendant_Count: 0 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 31 | Block: 31, 32, 33, 34 | Descendants: 32, 33, 34 | Descendant_Count: 3 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 32 | Block: 31, 32, 33, 34 | Descendants:  | Descendant_Count: 0 | Ancestors: 31 | Ancestor_Count: 1
    Database_ID: 33 | Block: 31, 32, 33, 34 | Descendants:  | Descendant_Count: 0 | Ancestors: 31 | Ancestor_Count: 1
    Database_ID: 34 | Block: 31, 32, 33, 34 | Descendants:  | Descendant_Count: 0 | Ancestors: 31 | Ancestor_Count: 1
    Database_ID: 35 | Block: 35, 36, 37, 38 | Descendants: 36, 37, 38 | Descendant_Count: 3 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 36 | Block: 35, 36, 37, 38 | Descendants:  | Descendant_Count: 0 | Ancestors: 35 | Ancestor_Count: 1
    Database_ID: 37 | Block: 35, 36, 37, 38 | Descendants:  | Descendant_Count: 0 | Ancestors: 35 | Ancestor_Count: 1
    Database_ID: 38 | Block: 35, 36, 37, 38 | Descendants:  | Descendant_Count: 0 | Ancestors: 35 | Ancestor_Count: 1
    Database_ID: 39 | Block: 39, 40, 41, 42 | Descendants: 40, 41, 42 | Descendant_Count: 3 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 40 | Block: 39, 40, 41, 42 | Descendants:  | Descendant_Count: 0 | Ancestors: 39 | Ancestor_Count: 1
    Database_ID: 41 | Block: 39, 40, 41, 42 | Descendants:  | Descendant_Count: 0 | Ancestors: 39 | Ancestor_Count: 1
    Database_ID: 42 | Block: 39, 40, 41, 42 | Descendants:  | Descendant_Count: 0 | Ancestors: 39 | Ancestor_Count: 1
    Database_ID: 43 | Block: 43, 44, 45, 46 | Descendants: 44, 45, 46 | Descendant_Count: 3 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 44 | Block: 43, 44, 45, 46 | Descendants:  | Descendant_Count: 0 | Ancestors: 43 | Ancestor_Count: 1
    Database_ID: 45 | Block: 43, 44, 45, 46 | Descendants:  | Descendant_Count: 0 | Ancestors: 43 | Ancestor_Count: 1
    Database_ID: 46 | Block: 43, 44, 45, 46 | Descendants:  | Descendant_Count: 0 | Ancestors: 43 | Ancestor_Count: 1
    Database_ID: 47 | Block: 47, 48, 49, 50 | Descendants: 48, 49, 50 | Descendant_Count: 3 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 48 | Block: 47, 48, 49, 50 | Descendants:  | Descendant_Count: 0 | Ancestors: 47 | Ancestor_Count: 1
    Database_ID: 49 | Block: 47, 48, 49, 50 | Descendants:  | Descendant_Count: 0 | Ancestors: 47 | Ancestor_Count: 1
    Database_ID: 50 | Block: 47, 48, 49, 50 | Descendants:  | Descendant_Count: 0 | Ancestors: 47 | Ancestor_Count: 1
    Database_ID: 51 | Block: 51, 52, 53, 54 | Descendants: 52, 53, 54 | Descendant_Count: 3 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 52 | Block: 51, 52, 53, 54 | Descendants:  | Descendant_Count: 0 | Ancestors: 51 | Ancestor_Count: 1
    Database_ID: 53 | Block: 51, 52, 53, 54 | Descendants:  | Descendant_Count: 0 | Ancestors: 51 | Ancestor_Count: 1
    Database_ID: 54 | Block: 51, 52, 53, 54 | Descendants:  | Descendant_Count: 0 | Ancestors: 51 | Ancestor_Count: 1
    Database_ID: 55 | Block: 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55 | Descendants: 32, 36, 40, 44, 48, 52 | Descendant_Count: 6 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 56 | Block: 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 56 | Descendants: 33, 37, 41, 45, 49, 53 | Descendant_Count: 6 | Ancestors:  | Ancestor_Count: 0
    Database_ID: 57 | Block: 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 57 | Descendants: 34, 38, 42, 46, 50, 54 | Descendant_Count: 6 | Ancestors:  | Ancestor_Count: 0
    The sorted output file for all databases has been saved: output_file_tsql_ancestry.xlsx
    

    The output of the two debugged databases in Excel (chosen columns only):

    enter image description here

    In the database2.accdb output, you see that the nodes 55, 56 and 57 link to descendants of every other level 1 node (Ancestor_Count = 1), they are one of the two roots that each level 1 needs.

    Parent 1 (level 0):

    Database_ID: 55 | Block: 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55 | Descendants: 32, 36, 40, 44, 48, 52 | Descendant_Count: 6 | Ancestors:  | Ancestor_Count: 0
    

    Parent 2 (level 0):

    Database_ID: 31 | Block: 31, 32, 33, 34 | Descendants: 32, 33, 34 | Descendant_Count: 3 | Ancestors:  | Ancestor_Count: 0
    

    Child (level 1):

    Database_ID: 32 | Block: 31, 32, 33, 34 | Descendants:  | Descendant_Count: 0 | Ancestors: 31 | Ancestor_Count: 1
    

    The ancestors of 32 are 31 and 55, but 32 shows only 31 as its only ancestor. This turns out to be a lucky shot that I did not aim at but kept in the code. Node 55 does not make it there since 31 does not reach it in its descendants while 55 reaches 31 as an ancestor of 32 but has many other descendants that cannot be reached by 32's family so that the blocks are not merged.

    ancestors.add() is not reached for 55 as the ancestor of 32:

                for target_idx, target_row in db_specific_df.iterrows():
                    if str(row['Database_ID']) in str(target_row['Descendants']).split(', '):
                        ancestors.add(target_row['Database_ID'])
                    if str(row['Database_ID']) in str(target_row['Descendants']).split(', '):
                        ancestors.add(target_row['Database_ID'])
                db_specific_df.at[idx, 'Ancestors'] = ', '.join(map(str, sorted(ancestors)))
    

    And the block does not merge 55 with 31, 32, 33, 34 either:

    fill_blocks() -> collect_block_ids(ancestor_ids, visited) recursion call:

                        ancestors = db_specific_df.loc[db_specific_df['Database_ID'] == current_id, 'Ancestors'].values[0]
                        if isinstance(ancestors, str):
                            ancestor_ids = [int(v.strip()) for v in ancestors.split(', ') if v.strip().isdigit()]
                            collect_block_ids(ancestor_ids, visited)
    

    Even though the ancestor 55 is lost for 32, it is an even bigger win to have the large block of 55 that shows the reader at first sight that this must be a root of many descendants (32, 36, 40, 44, 48, 52) which are linked to many ancestors: 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55. You would then begin in SSIS with the three roots 55, 56, 57, and for 55, you would go on and merge join it to the ancestors of its descendants (32, 36, 40, 44, 48, 52). Thus, you would check the ancestor of 32 and merge join it with 55 to make 32. Once that is done, you would do the same with the ancestor of 36, and so on, until you built all of the descendants of 55.

    The recursions work in another way than I wanted but turn out to be better for the sorting and oversight. They still keep the blocks as neighbourhoods of something you would see as a family of the same specific root if you sorted by "Block". For example, 31, 32, 33, 34 stays in one family even though 32, 33, 34 are just half siblings with 31 as their parent but 55 as the parent of 32, 56 that of 33, and 57 that of 34. The good thing of keeping aside the 55 here is that the block stays one group. It also does not get outweighed by nodes that reach far too many nodes but are just one parent for one ID of the block, like the 55, the parent of 32:

    enter image description here