Search code examples
pythondictionarynesteddefaultdict

Converting an unstructured list of names and data to nested dictionary


I have an "unstructured" list that looks like this:

info = [
    'Joe Schmoe',
    'W / M / 64',
    'Richard Johnson',
    'OFFICER',
    'W / M /48',
    'Adrian Stevens',
    '? / ? / 27'
    ]

Unstructured in that the list consists of sets of:

  • (Name, Officer Status, Demographic Info) triplets, or
  • (Name, Demographic Info) pairs.

In the latter case, Officer=False and in the former, Officer=True. The Demographic Info strings represent Race / Gender / Age, with NaNs represented by literal question marks. Here is where I'd like to get to:

res = {
    'Joe Schmoe': {
        'race': 'W',
        'gender': 'M',
        'age': 64,
        'officer': False
        },
    'Richard Johnson': {
        'race': 'W',
        'gender': 'M',
        'age': 48,
        'officer': True
        },
    'Adrian Stevens': {
        'race': 'NaN',
        'gender': 'NaN',
        'age': 27,
        'officer': False
        }
    }

Right now I've built two functions to do this. The first is below and handles the Demographic Info strings. (I'm fine with this one; just putting it here for reference.)

import re

def fix_demographic(info):
    # W / M / ?? --> W / M / NaN
    # ?/M/?  --> NaN / M / NaN
    # Keep as str NaN rather than np.nan for now
    race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
    return race, gender, age

The second function deconstructs the list and throws its values into different places in a dictionary result:

demographic = re.compile(r'(\w+|\?+)\s*\/\s*(\w+|\?+)\s*\/\s*(\w+|\?+)')


def parse_victim_info(info: list):
    res = defaultdict(dict)
    for i in info:
        if not demographic.fullmatch(i) and i.lower() != 'officer':
            # We have a name
            previous = 'name'
            name = i
        if i.lower() == 'officer':
            res[name]['officer'] = True
            previous = 'officer'
        if demographic.fullmatch(i):
            # We have demographic info; did "OFFICER" come before it?
            if previous == 'name':
                res[name]['officer'] = False
            race, gender, age = fix_demographic(i)
            res[name]['race'] = race
            res[name]['gender'] = gender
            res[name]['age'] = int(age) if age.isnumeric() else age
            previous = None
    return res

>>> parse_victim_info(info)
defaultdict(dict,
            {'Adrian Stevens': {'age': 27,
              'gender': 'NaN',
              'officer': False,
              'race': 'NaN'},
             'Richard Johnson': {'age': 48,
              'gender': 'M',
              'officer': True,
              # ... ...

This second function feels way too verbose & tedious for what it's doing.

Is there a better way about this that is able to more smartly remember the categorization of the last value seen in the iteration?


Solution

  • This sort of thing lends itself very nicely to a generator:

    Code:

    def find_triplets(data):
        data = iter(data)
        while True:
            name = next(data)
            demo = next(data)
            officer = demo == 'OFFICER'
            if officer:
                demo = next(data)
            yield name, officer, demo
    

    Test Code:

    info = [
        'Joe Schmoe',
        'W / M / 64',
        'Lillian Schmoe',
        'W / F / 60',
        'Richard Johnson',
        'OFFICER',
        'W / M /48',
        'Adrian Stevens',
        '? / ? / 27'
    ]
    
    for x in find_triplets(info):
        print(x)
    

    Results:

    ('Joe Schmoe', False, 'W / M / 64')
    ('Lillian Schmoe', False, 'W / F / 60')
    ('Richard Johnson', True, 'W / M /48')
    ('Adrian Stevens', False, '? / ? / 27')
    

    Converting tuples triplets to dict:

    import re
    
    def fix_demographic(info):
        # W / M / ?? --> W / M / NaN
        # ?/M/?  --> NaN / M / NaN
        # Keep as str NaN rather than np.nan for now
        race, gender, age = re.split('\s*/\s*', re.sub('\?+', 'NaN', info))
        return dict(race=race, gender=gender, age=age)
    
    
    data_dict = {name: dict(officer=officer, **fix_demographic(demo))
                 for name, officer, demo in find_triplets(info)}
    
    print(data_dict)
    

    Results:

    {
        'Joe Schmoe': {'officer': False, 'race': 'W', 'gender': 'M', 'age': '64'}, 
        'Lillian Schmoe': {'officer': False, 'race': 'W', 'gender': 'F', 'age': '60'}, 
        'Richard Johnson': {'officer': True, 'race': 'W', 'gender': 'M', 'age': '48'}, 
        'Adrian Stevens': {'officer': False, 'race': 'NaN', 'gender': 'NaN', 'age': '27'}
    }