Search code examples
pandasdataframeuniquemulti-indexhierarchical

How to create a multiIndex (hierarchical index) dataframe object from another df's column's unique values?


I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.

Is there an easier way to have this information summarized besides creating this dataframe?

Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...

Original Dataframe:

data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'], 
                     'B': [2001, 2002, 2003, 2004, 2005], 
                     'C': ['F', 'M', 'F', 'F', 'M'],
                     'D': [0, 0, 0, 1, 0],
                     'E': [1, 0, 1, 0, 1],
                     'F': [1, 1, 0, 0, 0]})


    A       B       C   D   E   F
0   case    2001    F   0   1   1
1   case    2002    M   0   0   1
2   case    2003    F   0   1   0
3   case    2004    F   1   0   0
4   case    2005    M   1   1   0

Desired outcome:

     unique  percent
A    case    100 
B    2001    20
     2002    20
     2003    20
     2004    20
     2005    20
C    F       60
     M       40
D    0       80
     1       20
E    0       40
     1       60
F    0       60
     1       40

My failed for loop attempt:

def unique_values(df):
    values = {}
    columns = []
    df = pd.DataFrame(values, columns=columns)
    for col in data:
        df2 = data[col].value_counts(normalize=True)*100
        values = values.update(df2.to_dict)
        columns = columns.append(col*len(df2))
    return df

unique_values(data)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
     11 
     12 
---> 13 unique_values(data)

<ipython-input-84-a341284fb859> in unique_values(df)
      5     for col in data:
      6         df2 = data[col].value_counts(normalize=True)*100
----> 7         values = values.update(df2.to_dict)
      8         columns = columns.append(col*len(df2))
      9     return df

TypeError: 'method' object is not iterable

Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.


Solution

  • This is a fairly straightforward application of .melt:

    data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)
    

    output

                    index
    variable value  
    A        case   1.0
    B        2001   0.2
             2002   0.2
             2003   0.2
             2004   0.2
             2005   0.2
    C        F      0.6
             M      0.4
    D        0      0.8
             1      0.2
    E        0      0.4
             1      0.6
    F        0      0.6
             1      0.4