Search code examples
pythonjsonparsingmarshalling

python - Efficiently parsing flattened strings into nested dictionaries


I'm obtaining a list of directories and items from a proprietary database. The lists can also be enormous containing thousands of views and a vareity of nesting. Example of list:

"MIPK",
"MIPK\/CM.toroidal",
"MIPK\/CM.Supervoid",
"MIPK\/DORAS",
"MIPK\/DORAS\/CRUDE",
"MIPK\/DORAS\/CRUDE\/CM.forest",
"MIPK\/DORAS\/CRUDE\/CM.benign",
"MIPK\/DORAS\/CRUDE\/CM.dunes",
"MIPK\/DORAS\/COMMODITIES",
"MIPK\/DORAS\/COMMODITIES\/CRUDE",
"MIPK\/DORAS\/COMMODITIES\/CRUDE\/CM.tangeant",
"MIPK\/DORAS\/COMMODITIES\/CRUDE\/CM.astral",
"MIPK\/DORAS\/COMMODITIES\/CRUDE\/CM.forking"

Directories are upper-case separated with \/ and mixed case represents items.

My current returned JSon is like this:

    {
    "contents": [{
        "root_path": "MIPK",
        "root_name": "MIPK",
        "directories": [{
            "subd_name": "DORAS",
            "subd_path": "MIPK.DORAS"
        }],
        "views": [{
                "view_name": "CM.toroidal"
            },
            {
                "view_name": "CM.Supervoid"
            }
        ]
    }, {
        "root_path": "MIPK.DORAS",
        "root_name": "DORAS",
        "directories": [{
                "subd_name": "CRUDE",
                "subd_path": "MIPK.DORAS.CRUDE"
            },
            {
                "subd_name": "COMMODITIES",
                "subd_path": "MIPK.DORAS.COMMODITIES"
            }
        ],
        "views": []
    }, {
        "root_path": "MIPK.DORAS.CRUDE",
        "root_name": "CRUDE",
        "directories": [],
        "views": [{
                "view_name": "CM.forest"
            },
            {
                "view_name": "CM.benign"
            },
            {
                "view_name": "CM.dunes"
            }
        ]
    }, {
        "root_path": "MIPK.DORAS.COMMODITIES",
        "root_name": "COMMODITIES",
        "directories": [{
            "subd_name": "CRUDE",
            "subd_path": "MIPK.DORAS.COMMODITIES.CRUDE"
        }],
        "views": []

    }, {
        "root_path": "MIPK.DORAS.COMMODITIES.CRUDE",
        "root_name": "CRUDE",
        "directories": [],
        "views": [{
                "view_name": "CM.tangeant"
            },
            {
                "view_name": "CM.astral"
            },
            {
                "view_name": "CM.forking"
            }
        ]
    }]

}

Current code:

import logging
import copy
import time

def fetch_resources(input_resources_list):
    '''
    :return: Json list of dictionaries each dictionary element containing:
    Directory name, directory path, list of sub-directories, list of views

    resource_list is a flattened list produced by a database walk function
    '''

    start = time.time()
    resources = {
        'contents': [{}]
    }

    for item in input_resources_list:
        # Parsing list into usable pieces
        components = item.rsplit('\\', 1)

        if len(components) == 1:
            # Handles first element
            root_dict = {'root_path': components[0],
                         'root_name': components[-1],
                         'directories': [],
                         'views': []
                         }
            resources['contents'][0].update(root_dict)
        else:
            #  Enumerate resources in list so search by key value can be done and then records can be appended.
            for ind, content in enumerate(copy.deepcopy(resources['contents'])):

                if resources['contents'][ind]['root_path'] == components[0]:
                    # Directories are upper case, adds a new entry if
                    if clean_item.isupper() :
                        root_dict = {'root_path': components[0],
                                     'root_name': components[-1],
                                     'directories': [],
                                     'views': []
                                     }
                        resources['contents'].append(root_dict)
                        sub_dict = {'subd_path': components[0],
                                    'subd_name': components[-1]}
                        resources['contents'][ind]['directories'].append(sub_dict)
                    elif clean_item.isupper() == False :
                        resources['contents'][ind]['views'] \
                            .append({'view_name':components[-1]})
    print 'It took {}'.format((time.time() - start)*1000) 
    return resources

This works fine on small workloads (circa 100-500), however not at the target workloads of 000's.

  1. How can I optimise the method for time?

  2. Currently the enumerate loop is rebuilt for every item in the input list in order to search by the root_path key values. Is there an easier way to search the list of dicts for key value and append entries to that entries directories and views list?


Solution

  • You’ve thrown away a lot of information in building these strings. For example, when you see MIPK\/DORAS, there’s no way to know whether it’s a file (which should just be a string in the parent directory’s list), a leaf directory (which should be a list in the parent directory’s dict), or an intermediate directory (which should be a dict in the parent directory’s dict). The best you can do (without a quadratic nested search) is guess, and then change it later if you guessed wrong.

    Also, sometimes your intermediate directories don’t even show up on their own, but only appear as components of later paths, but other times they do appear. So sometimes the “guess” that you have to fix will be the directory not existing at all.

    And finally, your desired format is ambiguous if, e.g., any directly can contain both files and directories (should it be a dict or a list), or nothing at all (should it be an empty dict, an empty list, or just a string?).

    However, things seem to be in sorted order, and the ambiguous cases don’t seem to actually show up. If we can rely on both of these, we can tell whether something is a directory by just seeing if it’s a prefix or the next entry. Then we can just read the leaves and put them into a collection of a strings inside a list inside 0 or more nested dicts.

    So, first, we want to iterate adjacent pairs of paths, to make it easier to do that “is it a prefix of the next value?” check:

    output = {}
    
    it1, it2 = itertools.tee(paths)
    next(it2)
    pairs = itertools.zip_longest(it1, it2, fillvalue='')
    for path, nextpath in pairs:
        if nextpath.startswith(path):
            continue
    

    Now, we know path is a leaf, so we need to find the list to append it to, creating one if needed, which may mean recursively creating dicts along the way:

    components = path.split(r'\/')
    d = output
    for component in components[:-2]:
        d = d.setdefault(component, {})
    d.setdefault(components[-2], []).append(components[-1])
    

    I haven’t tested this, but it should do the right thing for non-ambiguous input, but raise some kind of exception if any directory includes both files and subdirs, or any top-level directory includes files, or any of the other ambiguous cases (except for empty directories, which will be treated the same as files).

    Sure, this is kind of ugly, but that’s inherent in parsing an ugly format that relies on lots of special-case rules to handle what would otherwise be ambiguous.