Search code examples
pythonpandassparse-matrix

How to efficiently create a SparseDataFrame from a long table?


I have a SQL table which I can read in as a Pandas data frame, that has the following structure:

user_id    value
1          100
1          200
2          100
4          200

It's a representation of a matrix, for which all the values are 1 or 0. The dense representation of this matrix would look like this:

    100  200
1   1    1
2   1    0
4   0    1

Normally, to do this conversion you can use pivot, but in my case with tens or hundreds of millions of rows in the first table one gets a big dense matrix full of zeros which is expensive to drag around. You can convert it to sparse, but getting that far requires a lot of resources.

Right now I'm working on a solution to assign row numbers to each user_id, sorting, and then splitting the 'value' column into SparseSeries before recombining into a SparseDataFrame. Is there a better way?


Solution

  • I arrived at a solution, albeit a slightly imperfect one.

    What one can do is to manually create from the columns a number of Pandas SparseSeries, combine them into a dict, and then cast that dict to a DataFrame (not a SparseDataFrame). Casting as SparseDataFrame currently hits an immature constructor, which deconstructs the whole object into dense and then back into sparse form regardless of the input. Building SparseSeries into a conventional DataFrame, however, maintains sparsity but creates a viable and otherwise complete DataFrame object.

    Here's a demonstration of how to do it, written more for clarity than for performance. One difference with my own implementation is I created the dict of sparse vectors as a dict comprehension instead of a loop.

    import pandas
    import numpy
    
    df = pandas.DataFrame({'user_id':[1,2,1,4],'value':[100,100,200,200]})
    
    # Get unique users and unique features
    num_rows = len(df['user_id'].unique())
    num_features = len(df['value'].unique())
    unique_users = df['user_id'].unique().copy()
    unique_features = df['value'].unique().copy()
    unique_users.sort()
    unique_features.sort()
    
    
    # assign each user_id to a row_number
    user_lookup = pandas.DataFrame({'uid':range(num_rows), 'user_id':unique_users})
    
    
    vec_dict = {}
    
    # Create a sparse vector for each feature
    for i in range(num_features):
        users_with_feature = df[df['value']==unique_features[i]]['user_id']
    
        uid_rows = user_lookup[user_lookup['user_id'].isin(users_with_feature)]['uid']
    
        vec = numpy.zeros(num_rows)
        vec[uid_rows] = 1
    
        sparse_vec = pandas.Series(vec).to_sparse(fill_value=0)
    
        vec_dict[unique_features[i]] = sparse_vec
    
    
    my_pandas_frame = pandas.DataFrame(vec_dict)    
    my_pandas_frame = my_pandas_frame.set_index(user_lookup['user_id']) 
    

    The results:

    >>> my_pandas_frame
             100  200
    user_id          
    1          1    1
    2          1    0
    4          0    1
    
    >>> type(my_pandas_frame)
    <class 'pandas.core.frame.DataFrame'>
    
    >>> type(my_pandas_frame[100])
    <class 'pandas.sparse.series.SparseSeries'>
    

    Complete, but still sparse. There are a few caveats, if you do a simple copy or subset not-in-place then it will forget itself and try to recast to dense, but for my purposes I'm pretty happy with it.