Search code examples
pythonpandasnumpydictionary

Python Categorizing Dataframe columns based on part of the column name


Reproducible data:

import random
    data = {'test_1_a':random.sample(range(1, 50), 7),
            'test_1_b':random.sample(range(1, 50), 7),
    'test_1_c':random.sample(range(1, 50), 7),
    'test_2_a':random.sample(range(1, 50), 7),
    'test_2_b':random.sample(range(1, 50), 7),
    'test_2_c':random.sample(range(1, 50), 7),
    'test_3_a':random.sample(range(1, 50), 7),
    'test_4_b':random.sample(range(1, 50), 7),
    'test_4_c':random.sample(range(1, 50), 7)}
    df = pd.DataFrame(data)

Description

I have a data frame similar to the example I gave above with 1000ish columns. The column name format is as following:

test_number_family so test_1_c would be a number type of 1 and the family of "c"

I want to classify the df by column names of the same "family" type. So my final output needs to be a list of lists of same family values:

Output example:

[ [a_familily values], [b_familily values],...]

it would also look like the values of columns:

[ [test_1_a, test_2_a , test_3_a ] , [test_1_b, test_2_b , test_3_b ] , ...]

What I have:

#### transfers data frame into a sorted dict (by column name) by columns as key
col_names = [ i for (i,j) in df.iteritems() ]
col_vals = [ j for (i,j) in df.iteritems() ]

df_dict = dict(zip(col_names, col_vals))


families = np.unique([ i.split("_")[2] for i in dict1.keys() ])

I have classified each column name with its associated value and extracted the unique number of groups I want to have in the final output as "families". I now am seeking help in categorizing the data frame into a length(families) number of lists identical to the output example I have given above.

I hope my explanation has been clear, thank you for your time!


Solution

  • Let's keep track of the different families in a dictionary, the keys being the letters (the families) and the values being lists holding the columns from a certain family.

    Since we know that each column ends with a letter related with its family, we can use that as a key in the dictionary.

    from collections import defaultdict
    families = defaultdict(list)
    
    for col in df.columns:
        families[col[-1]].append(df[col])
    

    Now for example, in families["a"], we have:

    [0    26
     1    13
     2    11
     3    35
     4    43
     5    45
     6    46
     Name: test_1_a, dtype: int64,
     0    10
     1    15
     2    20
     3    43
     4    40
     5    35
     6    22
     Name: test_2_a, dtype: int64,
     0    35
     1    48
     2    38
     3    13
     4     3
     5    10
     6    25
     Name: test_3_a, dtype: int64]
    

    We can easily get a per-family dataframe with concat.

    df_a = pd.concat(families["a"], axis=1)
    

    Gets us:

       test_1_a  test_2_a  test_3_a
    0        26        10        35
    1        13        15        48
    2        11        20        38
    3        35        43        13
    4        43        40         3
    5        45        35        10
    6        46        22        25
    

    If we were to create a dictionary of dataframes per family,

    dfs = {f"df_{fam}" : pd.concat(families[fam], axis=1) for fam in families.keys()}
    

    Now, the dictionary dfs contains:

    {'df_a':    test_1_a  test_2_a  test_3_a
     0        26        10        35
     1        13        15        48
     2        11        20        38
     3        35        43        13
     4        43        40         3
     5        45        35        10
     6        46        22        25,
     'df_b':    test_1_b  test_2_b  test_4_b
     0        18         4        44
     1        48        43         2
     2        30        21         4
     3        46        12        16
     4        42        14        25
     5        22        24        13
     6        43        40        43,
     'df_c':    test_1_c  test_2_c  test_4_c
     0        25        15         5
     1        36        39        28
     2         6         3        37
     3        22        48        16
     4         2        34        25
     5        39        16        30
     6        32        36         2}