Search code examples
pythonpandastuples

Is there a way to have tuples woking fine as index in Pandas?


I would like to use a MultiIndex in Pandas where at every level I have a nested tuple. I know I could in principle unpack the thing but this would be less legible and annoying. In general, the elements of the tuple (a class name and some parameters) have meaning only together, I would like to make it harder to end up with nonsensical pairs, the tuples have different lengths, and I'd like to use MultiIndex.from_product.

Everything works fine when creating the DataFrame and accessing values, but when writing I get results I wasn't expecting.

In a simple example, the following code:

import pandas as pd
index=pd.MultiIndex.from_arrays([[("foo","spam"),("foo","spam")],[("bar","egg"),("bar","egg")],[("baz","bacon"),("pam","bacon")]])
this_index = (("foo","spam"),("bar","egg"),("baz","bacon"))
df = pd.DataFrame(index=index, columns=["value"])
print(df)
print(df.loc[this_index])
df.loc[this_index]=0
# df.loc[this_index,"value"]=0
print(df)

First prints the table I expected (three tuples as index and NaNs in the column value), then prints the correctly retrieved value NaN, but at the last line shows two extra columns named "bar" and "egg" both set to 0:

                                    value  bar  egg
(foo, spam) (bar, egg) (baz, bacon)     0  0.0  0.0
                       (pam, bacon)   NaN  NaN  NaN

In this case, using the commented line for the assignment gives the expected result.

However, in my case, I need "spam", "egg", and "bacon" to be tuples as well. If I change lines 2 and 3 in the code above putting:

index=pd.MultiIndex.from_arrays([[("foo",("spam",)),("foo",("spam",))],[("bar",("egg",)),("bar",("egg",))],[("baz",("bacon",)),("pam",("bacon",))]])
this_index = (("foo",("spam",)),("bar",("egg",)),("baz",("bacon",)))

I have again the expected behaviour with the first two prints, the third gives (now somehow expected):

                                             value  bar  (egg,)
(foo, (spam,)) (bar, (egg,)) (baz, (bacon,))     0  0.0     0.0
                             (pam, (bacon,))   NaN  NaN     NaN

But trying the same workaround as above gives:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.

And I couldn't find any way to adapt the trick.

The best workaround I found at the moment is to use str() on the tuples and then parse again the content if needed, but I feel like there should be a better way. The only trace I found here is an unanswered comment to this answer.


Solution

  • If I understand correctly, your issue is with this assignments:

    index=pd.MultiIndex.from_arrays([[("foo",("spam",)),("foo",("spam",))],[("bar",("egg",)),("bar",("egg",))],[("baz",("bacon",)),("pam",("bacon",))]])
    this_index = (("foo",("spam",)),("bar",("egg",)),("baz",("bacon",)))
    
    df = pd.DataFrame(index=index, columns=["value"])
    df.loc[this_index, 'value']=0
    

    Which you can solve using a list for the columns or for the index:

    df.loc[this_index, ['value']] = 0
    
    # or
    df.loc[[this_index], 'value'] = 0
    

    Output:

                                                 value
    (foo, (spam,)) (bar, (egg,)) (baz, (bacon,))     0
                                 (pam, (bacon,))   NaN