Search code examples
pythonpandasdataframedictionarygenerator

Return generator instead of list from df.to_dict()


I am working on a large Pandas DataFrame which needs to be converted into dictionaries before being processed by another API.

The required dictionaries can be generated by calling the .to_dict(orient='records') method. As stated in the docs, the returned value depends on the orient option:

Returns: dict, list or collections.abc.Mapping

Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.

For my case, passing orient='records', a list of dictionaries is returned. When dealing with lists, the complete memory required to store the list items, is reserved/allocated. As my dataframe can get rather large, this might lead to memory issues especially as the code might be executed on lower spec target systems.

I could certainly circumvent this issue by processing the dataframe chunk-wise and generate the list of dictionaries for each chunk which is then passed to the API. Furthermore, calling iter(df.to_dict(orient='records')) would return the desired generator, but would not reduce the required memory footprint as the list is created intermediately.

Is there a way to directly return a generator expression from df.to_dict(orient='records') instead of a list in order to reduce the memory footprint?


Solution

  • There is not a way to get a generator directly from to_dict(orient='records'). However, it is possible to modify the to_dict source code to be a generator instead of returning a list comprehension:

    from pandas.core.common import standardize_mapping
    from pandas.core.dtypes.cast import maybe_box_native
    
    
    def dataframe_records_gen(df_):
        columns = df_.columns.tolist()
        into_c = standardize_mapping(dict)
    
        for row in df_.itertuples(index=False, name=None):
            yield into_c(
                (k, maybe_box_native(v)) for k, v in dict(zip(columns, row)).items()
            )
    

    Sample Code:

    import pandas as pd
    
    df = pd.DataFrame({
        'A': [1, 2],
        'B': [3, 4]
    })
    
    # Using Generator
    for row in dataframe_records_gen(df):
        print(row)
    
    # For Comparison with to_dict function
    print("to_dict", df.to_dict(orient='records'))
    

    Output:

    {'A': 1, 'B': 3}
    {'A': 2, 'B': 4}
    to_dict [{'A': 1, 'B': 3}, {'A': 2, 'B': 4}]
    

    For more natural syntax, it's also possible to register a custom accessor:

    import pandas as pd
    from pandas.core.common import standardize_mapping
    from pandas.core.dtypes.cast import maybe_box_native
    
    
    @pd.api.extensions.register_dataframe_accessor("gen")
    class GenAccessor:
        def __init__(self, pandas_obj):
            self._obj = pandas_obj
    
        def records(self):
            columns = self._obj.columns.tolist()
            into_c = standardize_mapping(dict)
    
            for row in self._obj.itertuples(index=False, name=None):
                yield into_c(
                    (k, maybe_box_native(v))
                    for k, v in dict(zip(columns, row)).items()
                )
    

    Which makes this generator accessible via the gen accessor in this case:

    df = pd.DataFrame({
            'A': [1, 2],
            'B': [3, 4]
        })
    
    # Using Generator through registered custom accessor
    for row in df.gen.records():
        print(row)
    
    # For Comparison with to_dict function
    print("to_dict", df.to_dict(orient='records'))
    

    Output:

    {'A': 1, 'B': 3}
    {'A': 2, 'B': 4}
    to_dict [{'A': 1, 'B': 3}, {'A': 2, 'B': 4}]