Search code examples
pythonpandasdataframedictionarydictionary-comprehension

What is the best way to remove multiple entries by position and (subkeys in nested dictionary or indices in a multidimensional pandas dataframework)?


Let's say that I have a dictionary which has the structure:

Dictionary_ABC = {'ABC 1':{'Zyzz': ['Val_1', '1', ..., 'Val_N'], 'ABC 1 X': ['Val_1', Nan, ..., 'Val_N'], 'ABC 1 Y': ['Val_1', Nan, ...,'Val_N'], 'ABC 1 Z': ['Val_1', Nan, ... 'Val_N'], 'Zim': ['Val_1', '1',..., 'Val_N']}, ..., 'ABC M':{'Zyzz': ['Val_1', '1', '2', ..., 'Val_N'], 'ABC M X': ['Val_1', Nan, Nan, ..., 'Val_N'], 'ABC M Y': ['Val_1', Nan, Nan, ...,'Val_N'], 'ABC M Z': ['Val_1', Nan, Nan, ..., 'Val_N'], 'Zim': ['Val_1', '1', '2',..., 'Val_N']}}

Inside of this Dictionary_ABC, contains keys which are named ABC_1 to ABC_M, M being a completely arbitrary number. Inside of these keys exist subkeys which have random names 'Zyzz' and 'Zim', but they also have three non-random subkeys which share a naming convention with the key such as: 'ABC 1 X', 'ABC 1 Y', 'ABC 1 Z'. Each of these subkeys contains a list of values, some containing actual values, others containing NaN value. The goal is to remove these NaN values from the ABC subkeys, and to delete the information in 'Zim' and 'Zyzz' at the same position even if they contain values.

So, in the example above my ideal output would look like:

Dictionary_ABC = {'ABC 1':{'Zyzz': ['Val_1', ..., 'Val_N-1'], 'ABC 1 X': ['Val_1',..., 'Val_N-1'], 'ABC 1 Y': ['Val_1', ...,'Val_N-1'], 'ABC 1 Z': ['Val_1', ... 'Val_N-1'], 'Zim': ['Val_1',..., 'Val_N-1']}, ..., 'ABC M':{'Zyzz': ['Val_1',..., 'Val_N-2'], 'ABC M X': ['Val_1',..., 'Val_N-2'], 'ABC M Y': ['Val_1',...,'Val_N-2'], 'ABC M Z': ['Val_1',..., 'Val_N-2'], 'Zim': ['Val_1',..., 'Val_N-2']}}

For the sake of simplicity, I assumed that only one entry was removed from the same position in the list contained in the 'ABC 1' subkeys, as well as two entries from the same position in the lists contained in the 'ABC M' subkeys. However, it could be any size. So to recap if I was unclear, the goal is to remove information from the subkey lists at the same position of a given key if any of the ABC subkeys contain NaN.

What I attempted was to transform this nested dictionary into a Pandas Dataframe to perform operations on it. I received rows of tuples and columns ranging from 1 to Ψ, Ψ being the arbitrary number of these columns. So if I represent the information using the example above, it looked like:

('ABC 1', 'Zyzz')    ['Val_1', '1', ..., 'Val_N']
('ABC 1', 'ABC 1 X') ['Val_1', NaN, ..., 'Val_N']
('ABC 1', 'ABC 1 Y') ['Val_1', NaN, ..., 'Val_N']
('ABC 1', 'ABC 1 Z') ['Val_1', NaN, ..., 'Val_N']
('ABC 1', 'Zim')     ['Val_1', '1', ..., 'Val_N']
        . 
        .
        .
('ABC M', 'Zyzz')    ['Val_1', '1', '2',..., 'Val_N']
('ABC M', 'ABC M X') ['Val_1', Nan, Nan, ...,'Val_N']
('ABC M', 'ABC M Y') ['Val_1', Nan, Nan, ...,'Val_N']
('ABC M', 'ABC M Z') ['Val_1', Nan, Nan, ...,'Val_N']
('ABC M', 'Zim')     ['Val_1', '1', '2',..., 'Val_N']

I attempted to index NaN values using this code:

NaN_Index = []
for row in cleaning_dataframe.iterrows():
    current_index = cleaning_dataframe[row].index[cleaning_dataframe[row].apply(np.isnan)]
    NaN_Index.append(current_index)

The idea was to insert these values inside of a list which could then be later used by the df.drop function in the loop to remove aggregate errors. I received this KeyError:

KeyError: (('ABC 1', 'Zyzz'), 0 Val_1
1 Val_2
2 Val_3
  .
  .
  .
N+1 Val_N

The idea was to create a function like this:

for row in df.iterrows():
    for i in NaN_Index:
        DataFrame.drop(labels=row, axis=0, index=i, columns=None, level=None, inplace=False, errors='raise')

With that perfunctory out of the way, is this the best way to go about doing this? Is there some form of dictionary comprehension that I could use which would not require me to transform my nested dictionary into a dataframe? Is there a better way of instantiating this dataframe so that I can sift through and easily deleted the rows or columns that I want to? Thank you so much for your consideration. Please feel free to ask me any questions.

Edit: I have also attempted a nested loop to see if I could find values in the same position to delete.

for k in d:
    for sk in d[k]:
        if re.findal("ABC \d+", sk) == re.findall("ABC \d+", k):
           for v in d[k][sk]:
               if all(np.isnan(d[k][sk][v])):
                      print("you've reached this point")

Unfortunately I receive a type error. List indices must be integers or slices, not str.


Solution

  • One idea is create a dataframe from the first nested dictionary. so for example, for the first item, it would give

    print(pd.DataFrame(Dictionary_ABC['ABC 1']))
        Zyzz ABC 1 X ABC 1 Y ABC 1 Z    Zim
    0  Val_1   Val_1   Val_1   Val_1  Val_1
    1      1     NaN     NaN     NaN      1
    2  Val_N   Val_N   Val_N   Val_N  Val_N
    

    Now from it, you can perform the selection of the row with loc you wan keep by look if all columns isna once filter the columns like 'ABC'

    print(pd.DataFrame(Dictionary_ABC['ABC 1'])
            .loc[lambda x: ~x.filter(like='ABC').isna().all(axis=1)])
        Zyzz ABC 1 X ABC 1 Y ABC 1 Z    Zim
    0  Val_1   Val_1   Val_1   Val_1  Val_1
    2  Val_N   Val_N   Val_N   Val_N  Val_N
    

    you just need to package it back to_dict oriented as list. Use a dictionary comprehension to create the new dictionnary. Starting from this:

    Dictionary_ABC = {
        'ABC 1':{'Zyzz': ['Val_1', '1',  'Val_N'], 
                 'ABC 1 X': ['Val_1', np.nan,  'Val_N'], 
                 'ABC 1 Y': ['Val_1', np.nan, 'Val_N'], 
                 'ABC 1 Z': ['Val_1', np.nan, 'Val_N'], 
                 'Zim': ['Val_1', '1', 'Val_N']},  
        'ABC M':{'Zyzz': ['Val_1', '1', '2',  'Val_N'], 
                 'ABC M X': ['Val_1', np.nan, np.nan,  'Val_N'],
                 'ABC M Y': ['Val_1', np.nan, np.nan, 'Val_N'], 
                 'ABC M Z': ['Val_1', np.nan, np.nan,  'Val_N'], 
                 'Zim': ['Val_1', '1', '2', 'Val_N']}}
    

    you need to do:

    new_dict = {
        key: (pd.DataFrame(val)
                .loc[lambda x: ~x.filter(like='ABC').isna().all(axis=1)]
                .to_dict(orient='list'))
        for key, val in Dictionary_ABC.items()
    }
    new_dict
    {'ABC 1': {'Zyzz': ['Val_1', 'Val_N'],
      'ABC 1 X': ['Val_1', 'Val_N'],
      'ABC 1 Y': ['Val_1', 'Val_N'],
      'ABC 1 Z': ['Val_1', 'Val_N'],
      'Zim': ['Val_1', 'Val_N']},
     'ABC M': {'Zyzz': ['Val_1', 'Val_N'],
      'ABC M X': ['Val_1', 'Val_N'],
      'ABC M Y': ['Val_1', 'Val_N'],
      'ABC M Z': ['Val_1', 'Val_N'],
      'Zim': ['Val_1', 'Val_N']}}