How to efficiently create a SparseDataFrame from a long table?

I have a SQL table which I can read in as a Pandas data frame, that has the following structure:

user_id    value
1          100
1          200
2          100
4          200

It's a representation of a matrix, for which all the values are 1 or 0. The dense representation of this matrix would look like this:

Normally, to do this conversion you can use pivot, but in my case with tens or hundreds of millions of rows in the first table one gets a big dense matrix full of zeros which is expensive to drag around. You can convert it to sparse, but getting that far requires a lot of resources.

Right now I'm working on a solution to assign row numbers to each user_id, sorting, and then splitting the 'value' column into SparseSeries before recombining into a SparseDataFrame. Is there a better way?

Solution

I arrived at a solution, albeit a slightly imperfect one.

What one can do is to manually create from the columns a number of Pandas SparseSeries, combine them into a dict, and then cast that dict to a DataFrame (not a SparseDataFrame). Casting as SparseDataFrame currently hits an immature constructor, which deconstructs the whole object into dense and then back into sparse form regardless of the input. Building SparseSeries into a conventional DataFrame, however, maintains sparsity but creates a viable and otherwise complete DataFrame object.

Here's a demonstration of how to do it, written more for clarity than for performance. One difference with my own implementation is I created the dict of sparse vectors as a dict comprehension instead of a loop.

import pandas
import numpy

df = pandas.DataFrame({'user_id':[1,2,1,4],'value':[100,100,200,200]})

# Get unique users and unique features
num_rows = len(df['user_id'].unique())
num_features = len(df['value'].unique())
unique_users = df['user_id'].unique().copy()
unique_features = df['value'].unique().copy()
unique_users.sort()
unique_features.sort()


# assign each user_id to a row_number
user_lookup = pandas.DataFrame({'uid':range(num_rows), 'user_id':unique_users})


vec_dict = {}

# Create a sparse vector for each feature
for i in range(num_features):
    users_with_feature = df[df['value']==unique_features[i]]['user_id']

    uid_rows = user_lookup[user_lookup['user_id'].isin(users_with_feature)]['uid']

    vec = numpy.zeros(num_rows)
    vec[uid_rows] = 1

    sparse_vec = pandas.Series(vec).to_sparse(fill_value=0)

    vec_dict[unique_features[i]] = sparse_vec


my_pandas_frame = pandas.DataFrame(vec_dict)    
my_pandas_frame = my_pandas_frame.set_index(user_lookup['user_id'])

The results:

>>> my_pandas_frame
         100  200
user_id          
1          1    1
2          1    0
4          0    1

>>> type(my_pandas_frame)
<class 'pandas.core.frame.DataFrame'>

>>> type(my_pandas_frame[100])
<class 'pandas.sparse.series.SparseSeries'>

Complete, but still sparse. There are a few caveats, if you do a simple copy or subset not-in-place then it will forget itself and try to recast to dense, but for my purposes I'm pretty happy with it.