Search code examples
pythonpandasdictionarydataframesparse-matrix

Pandas DataFrame.from_dict() poor performance when generating from a lengthy dict of dicts


In my Python application, I find it handy to use a dictionary of dictionaries as the source data for constructing a sparse pandas DataFrame, which I then use to train a model in sklearn.

The structure of the dictionary is like this:

data = {"X": {'a': 1, 'b': 2, 'c': 3}, "Y": {'d': 4, 'e': 5, 'f': 6}, "Z": {'g': 7, 'h': 8, 'i': 9}}

Ideally, I'd like turn it into a dataframe like this:

df = pandas.DataFrame.from_dict(data, orient="index").fillna(0).astype(int)

Which generates this:

e d f a c b i h g X 0 0 0 1 3 2 0 0 0 Y 5 4 6 0 0 0 0 0 0 Z 0 0 0 0 0 0 9 8 7

Now, here's my problem. My data has a number of rows in the hundreds of thousands (ie, the number of keys in the outer dictionary). Each one of these has only a handful of columns associated with it (ie, the number of keys in each inner dictionary), but the total number of columns numbers in the thousands. I've found DataFrame generation using from_dict to be very slow, on the order of 2.5-3 minutes for 200,000 rows and 6,000 columns.

Furthermore, in the case when the row index is a MultiIndex (ie, instead of X, Y and Z the keys of the outer directionary are tuples), from_dict is even slower, on the order of 7+ minutes for 200,000 rows. I've found that this overhead can be avoided if instead of a dictionary of dictionaries, one uses a list of dictionaries and then adds the MultiIndex back to the resulting DataFrame using set_index.

In summary, how would you suggest I deal with this? Performance with the MultiIndex can clearly be improved by the library developers, but am I using the wrong tool for the job here? If written to disk, the DataFrame is around 2.5GB in size. Reading a 2.5GB file from disk in around 2 or so minutes seems about right, but the sparsity of my data in memory should theoretically allow this to be much faster.


Solution

  • It turns out that sklearn has a class that does exactly what I need.

    sklearn.feature_extraction.DictVectorizer

    I generate the data as a list of dictionaries, setting the row labels aside. And then:

    vectorizer = sklearn.feature_extraction.DictVectorizer(dtype=numpy.uint8, 
    sparse=False)
    
    matrix = vectorizer.fit_transform(data)
    column_labels = vectorizer.get_feature_names()
    
    df = pandas.DataFrame(matrix, index=row_labels, columns=column_labels)
    

    That finishes within a minute or so, which is fast enough for me. Maybe somebody can improve it further.