Search code examples
pythonpandasdataframedictionarydefaultdict

remove nan values from defaultdict(list) of dicts


I have the following code that I have created from running some analysis and I have put the results in a defaultdict(list). Afterwards I put the results into a csv file. First, Id like to remove the items that contain 'nan' values in Check2

How would I remove the values inside of the list of dicts?

from numpy import nan 
from collections import defaultdict

d = defaultdict(list,
                     {'Address_1': [{'Name': 'name',
               'Address_match': 'address_match_1',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 8,
                 'Check2' : 1},
              {'Name': 'name',
               'Address_match': 'address_match_2',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 20,
                 'Check2' : nan},
              {'Name': 'name',
               'Address_match': 'address_match_3',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 27,
                 'Check2' : nan}],
              'Address_2': [{'Name': 'name',
               'Address_match': 'address_match_1',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 30,
                 'Check2' : 1},
              {'Name': 'name',
               'Address_match': 'address_match_2',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 38,
                 'Check2' : nan},
              {'Name': 'name',
               'Address_match': 'address_match_3',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 12,
                 'Check2' : nan}]})

Afterwards my results should be:

d = defaultdict(list,
                     {'Address_1': [{'Name': 'name',
               'Address_match': 'address_match_1',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 8,
                 'Check2' : 1}],
              'Address_2': [{'Name': 'name',
               'Address_match': 'address_match_1',
               'ID': 'id',
               'Type': 'abc',
                'Check1' : 30,
                 'Check2' : 1}
            ]})

Solution

  • Try:

    df = pd.DataFrame.from_records(d).unstack()
    d = df[df.str['Check2'].notna()].unstack(level=0).to_dict('list')
    print(d)
    
    # Output:
    {'Address_1': [{'Name': 'name',
       'Address_match': 'address_match_1',
       'ID': 'id',
       'Type': 'abc',
       'Check1': 8,
       'Check2': 1}],
     'Address_2': [{'Name': 'name',
       'Address_match': 'address_match_1',
       'ID': 'id',
       'Type': 'abc',
       'Check1': 30,
       'Check2': 1}]}
    

    Update

    You can simply use a double comprehension:

    d = [{k: [v for v in l if pd.notna(v['Check2'])]} for k, l in d.items()]
    print(d)
    
    # Output:
    [{'Address_1': [{'Name': 'name',
        'Address_match': 'address_match_1',
        'ID': 'id',
        'Type': 'abc',
        'Check1': 8,
        'Check2': 1}]},
     {'Address_2': [{'Name': 'name',
        'Address_match': 'address_match_1',
        'ID': 'id',
        'Type': 'abc',
        'Check1': 30,
        'Check2': 1}]}]
    

    To be more understandable, here is the version with normal loops:

    data = defaultdict(list)
    for k, l in d.items():  # for each key in d (Address_1, Address_2, ...)
        for v in l: # for each record in key {'Name': ...}
            if pd.notna(v['Check2']):  # check the condition
                data[k].append(v)  # append to the dict