Search code examples
python-3.xpandasmulti-index

Convert string representation of multiIndex pandas into multiIndex pandas in python


I have a string representation of a multiIndex below.

iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
df = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = str(df)

I would like to convert string represented df back into a pandas multiIndex class. Are there any direct functions available in pandas for the same?

Excepted output:

print(df)
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
       labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
       names=['first', 'second'])

Thanks in advance.


Solution

  • The string representation of the MultiIndex is nearly executable code, so you could evaluate it with eval, like this:

    eval(df, {}, {'MultiIndex': pd.MultiIndex})
    # MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
    #        labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
    #        names=[u'first', u'second'])
    

    Just be careful that you have control of the string you pass to eval, since it could be used to crash your computer and/or run arbitrary code (see here and here).

    Alternatively, here's a safe and simple but somewhat brittle way to do this:

    import ast
    # convert df into a literal string defining a dictionary
    dfd = (
        "{" + df[11:-1] + "}"
            .replace("levels=", "'levels':")
            .replace("labels=", "'labels':")
            .replace("names=", "'names':") 
    )
    # convert it safely into an actual dictionary
    args = ast.literal_eval(dfd)
    # use the dictionary as arguments to pd.MultiIndex
    pd.MultiIndex(**args)
    

    With this code, there's no way for arbitrary strings to crash your computer, since ast.literal_eval() doesn't allow any operators, just literal expressions.

    Here's a version that's safe and doesn't require pre-specifying the argument names, but it's more complex:

    import ast, tokenize
    from cStringIO import StringIO
    tokens = [  # make a list of mutable tokens
        list(t) 
        for t in tokenize.generate_tokens(StringIO('{' + df[11:-1] + '}').readline)
    ]
    for t, next_t in zip(tokens[:-1], tokens[1:]):
        # convert `identifier=` to `'identifier':`
        if t[0] == 1 and next_t[0] == 51 and next_t[1] == '=':
            t[0] = 3                  # switch type to quoted string
            t[1] = "'" + t[1] + "'"   # put quotes around identifier
            next_t[1] = ':'           # convert '=' to ':' 
    args = ast.literal_eval(tokenize.untokenize(tokens))
    pd.MultiIndex(**args)
    

    Note that this code will raise an exception if df is malformed or contains 'identifier=...' as code (not inside strings) at lower levels. But I don't think that can happen with str(MultiIndex). If that is an issue, you could generate an ast tree for the original df string, then extract the arguments and convert those programmatically into a literal definition for a dict ({x: y}, not dict(x=y)), then use ast.literal_eval to evaluate that.