Search code examples
pythonpandaspython-3.4

Python/pandas: data frame from series of dict: optimization


I have a pandas Series of dictionnaries, and I want to convert it to a data frame with the same index.

The only way I found is to pass through the to_dict method of the series, which is not very efficient because it goes back to pure python mode instead of numpy/pandas/cython.

Do you have suggestions for a better approach?

Thanks a lot.

>>> import pandas as pd
>>> flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
>>> flagInfoSeries
0      {'a': 1, 'b': 2}
1    {'a': 10, 'b': 20}
dtype: object
>>> pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

Solution

  • I think you can use comprehension:

    import pandas as pd
    
    flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
    print flagInfoSeries
    0      {u'a': 1, u'b': 2}
    1    {u'a': 10, u'b': 20}
    dtype: object
    
    print pd.DataFrame(flagInfoSeries.to_dict()).T
        a   b
    0   1   2
    1  10  20
    
    print pd.DataFrame([x for x in flagInfoSeries])
        a   b
    0   1   2
    1  10  20
    

    Timing:

    In [203]: %timeit pd.DataFrame(flagInfoSeries.to_dict()).T
    The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 554 µs per loop
    
    In [204]: %timeit pd.DataFrame([x for x in flagInfoSeries])
    The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 361 µs per loop
    
    In [209]: %timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
    The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 751 µs per loop
    

    EDIT:

    If you need keep index, try add index=flagInfoSeries.index to DataFrame constructor:

    print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
    

    Timings:

    In [257]: %timeit pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
    1000 loops, best of 3: 350 µs per loop
    

    Sample:

    import pandas as pd
    
    flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
    flagInfoSeries.index = [2,8]
    print flagInfoSeries
    2      {u'a': 1, u'b': 2}
    8    {u'a': 10, u'b': 20}
    
    print pd.DataFrame(flagInfoSeries.to_dict()).T
        a   b
    2   1   2
    8  10  20
    
    print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
        a   b
    2   1   2
    8  10  20