Search code examples
pythonrsparse-matrix

Pros and cons to using sparse matrices in python/R?


I'm working with large, sparse matrices (document-feature matrices generated from text) in python. It's taking quite a bit of processing time and memory to chew through these, and I imagine that sparse matrices could offer some improvements. But I'm worried that using a sparse matrix library is going to make it harder to plug into other python (and R, through rpy2) modules.

Can people who've crossed this bridge already offer some advice? What are the pros and cons of using sparse matrices in python/R, in terms of performance, scalability, and compatibility?


Solution

  • Using sparse matrices in Python might not be a great idea in itself. Have you checked out sparse matrices in numpy / scipy?

    Numpy brings the immense benefit of using mainly C code to provide performance gains in Python.

    From my limited experience of doing text processing in R, the performance makes it pretty much unusable for anything beyond exploratory data analysis.

    Regardless, you shouldn't be using vanilla lists for sparse matrices, it will (understandably) take a while to chew through them.