Search code examples
pythonpandascrosstab

How to convert a dict into a pandas multiindex dataframe


I´m handling distances between elements (e.g. a1,a2, ...) of different attributes (e.g. A, B, ...) and I choose a dict to store the distances. The dict has the following form:

mydict = {('A', ('a1','a2')): 1.0,
          ('A', ('a1','a3')): 0.5,
          ('A', ('a2','a1')): 1.1,
          ('A', ('a2','a3')): 0.8,
          ('A', ('a3','a1')): 1.2,
          ('A', ('a3','a2')): 1.2,
          ('B', ('b1','b2')): 1.0,
          ('B', ('b1','b3')): 0.5,
          ('B', ('b2','b1')): 1.1,
          ('B', ('b2','b3')): 0.8,
          ('B', ('b3','b1')): 1.2,
          ('B', ('b3','b2')): 1.2,
          }

So the keys of the dict are tuples with a first element giving the attribute and the second element beeing a tuple itself giving the two elements, which distance is given in the corresponding value.

Now I want to display the data in form of crosstables which should look somewhat like this:

A   a1   a2   a3
a1    0  1.0   0.5
a2  1.1    0   0.8
a3  1.2  1.2    0

B   b1   b2   b3
b1    0  1.0   0.5
b2  1.1    0   0.8
b3  1.2  1.2    0

and so on for each attribut.

I tried to to convert the data to a DataFrame in oder to maybe use the casstab function of pandas. I tried to convert the keys of the dict to a list and use pandas.MultiIndex.from_tuples and then MultiIndex.to_frame but I did´t get a usable format.

Any suggestions how to deal with this or to store the distance data differently to begin with?


Solution

  • I think the format of the data is fine. You just need to unpack it correctly to get a useable dataframe.

    • Construct a dataframe from the dict values and dict keys.
    • Unpack the level of the index with the element names to two columns.
    • Apply crosstab with the appropriate index and columns (apply .fillna(0) to the result to get the exact same structure as in your question).
    df = pd.DataFrame(my_dict.values(), pd.MultiIndex.from_tuples(my_dict.keys()))
    df[['first_element', 'second_element']] = df.index.get_level_values(1).to_list()
    pd.crosstab(
                [df.index.get_level_values(0), df.first_element],
                df.second_element,
                values=df[0],
                aggfunc='sum'
                )