Search code examples
python-3.xpandassparse-dataframe

How to drop rows from a Sparse Dataframe without changing the format


I am trying to drop some empty rows in my dataframe. The following code shows that the datatypes are indeed sparse.

items_users_sparse_top_tags_df = items_users_sparse_pd.loc[tracks_tags_df.index]
items_users_sparse_top_tags_df.rename_axis('tracks', axis = 'index', inplace = True)
items_users_sparse_top_tags_df.dtypes

and the result:

playlists
37i9dQZF1DX7KNKjOK0o75    Sparse[int64, 0]
37i9dQZF1DWT1y71ZcMPe5    Sparse[int64, 0]
37i9dQZF1DX1tyCD9QhIWF    Sparse[int64, 0]
37i9dQZF1DWSXBu5naYCM9    Sparse[int64, 0]
3JwPVKISB9IBlE2RST1MVn    Sparse[int64, 0]
                                      
0lDMDuxqUYRAHAg2aSB4Mh    Sparse[int64, 0]
6JX1W7EUwl28ApynqRIzGd    Sparse[int64, 0]
73pA7uClVdMP4UM4NHYkjw    Sparse[int64, 0]
7rRuBmh62FSsGh7ymtIUl3    Sparse[int64, 0]
2moEpTGsu9XpWjc7DMCgH6    Sparse[int64, 0]
Length: 3990, dtype: object

When I try to remove the users that are empty (as rows after the transpose), the dtype is being changed. The code:

users_items_sparse_dropped = items_users_sparse_top_tags_df.T[(items_users_sparse_top_tags_df !=0).any()]

the dtypes:

tracks
2res3Ptlahsu1kh5XtFhu4    object
4UGxnxGlpc7BB8Cbu8vITC    object
63diy8Bzm0pHMAU37By2Nh    object
6wBHYoPsAqS88OwfjCvlaq    object
1aoaegj0Bv8p1N6dWyCDbr    object
                           ...  
2IH4PRZxA3W6sIWcFU0GKZ    object
2JKlf0IYz5oWsT3OCLyjpO    object
0fa2P8krhE1K19MUUh0meb    object
2CM7CAL7aJ5WkPU0oGbA96    object
0w2U0uERbUTJMNIKdTSUkj    object
Length: 15679, dtype: object

While the code indeed removes the empty users-as-rows, I would prefer to keep the dataframe sparse so I do not have to transform it again.

The reasoning behind using sparse dataframes and not directly scipy sparse formats is keeping the IDs as indexes and not messing up during data manipulation etc.


Solution

  • Answering my own question, the issue was the compatibility between the int64 of the non empty values, and the nan of the empty values because the nan values are considered to be floats.

    When I was transposing the matrix, the dtypes were getting changed from Sparse[int64, 0] to dtype:O.

    There are a few workarounds 1)Cast the dataframe to float using astype. 2)If someone really wants to preserve the sparse int64 format, a new sparse dtype need to be created by using: pd.SparseDtype(int, fill_value = np.nan) and then cast using astyp after the dataframe manipulation.

    Lastly as far I tried, similar restrictions apply to numpy sparse formats.

    P.S. An interesting find: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html