Search code examples
pythonrfeather

Conversion to FEATHER file creates huge file


I am trying to turn an .rds file into a .feather file for reading with Pandas in Python.

library(feather)

# Set working directory
data = readRDS("file.rds")
data_year = data[["1986"]]

# Try 1
write_feather(
  data_year,
  "data_year.feather"
  )

# Try 2
write_feather(
  as.data.frame(as.matrix(data_year)),
  "data_year.feather"
)

Try 1 returns Error: 'x' must be a data frame and Try 2 actually writes a *.feather file but the file has a size of 4.5GB for a single year whereas the original *.rds file has a size of 0.055GB for several years.

How can I turn the file into separate or non-separate *.feather files for each year whilst maintaining an adequate file size?

enter image description here

data looks like this:

enter image description here

data_year looks like this:

enter image description here

*Update

I am open to any suggestions for making the data available for use in NumPy/Pandas whilst maintaining a modest file size!


Solution

  • With scipy and rpy2, you can read each dgCMatrix object directly into Python as a scipy.sparse.csc_matrix object. Both use compressed sparse column (CSC) format, so there is actually zero need for preprocessing. All you need to do is pass the attributes of the dgCMatrix object as arguments to the csc_matrix constructor.

    To test it out, I used R to create an RDS file storing a list of dgCMatrix objects:

    library("Matrix")
    set.seed(1L)
    
    d <- 6L
    n <- 10L
    l <- replicate(n, sparseMatrix(i = sample(d), j = sample(d), x = sample(d), repr = "C"), simplify = FALSE)
    names(l) <- as.character(seq(1986L, length.out = n))
    
    l[["1986"]]
    ## 6 x 6 sparse Matrix of class "dgCMatrix"
    ##                 
    ## [1,] . . 5 . . .
    ## [2,] 3 . . . . .
    ## [3,] . . . . . 6
    ## [4,] . 2 . . . .
    ## [5,] . . . . 1 .
    ## [6,] . . . 4 . .
    
    saveRDS(l, file = "list_of_dgCMatrix.rds")
    

    Then, in Python:

    from scipy import sparse
    from rpy2  import robjects
    readRDS = robjects.r['readRDS']
    
    l = readRDS('list_of_dgCMatrix.rds')
    x = l.rx2('1986') # in R: l[["1986"]]
    x
    ## <rpy2.robjects.methods.RS4 object at 0x120db7b00> [RTYPES.S4SXP]
    ## R classes: ('dgCMatrix',)
    
    print(x)
    ## 6 x 6 sparse Matrix of class "dgCMatrix"
    ##                 
    ## [1,] . . 5 . . .
    ## [2,] 3 . . . . .
    ## [3,] . . . . . 6
    ## [4,] . 2 . . . .
    ## [5,] . . . . 1 .
    ## [6,] . . . 4 . .
    
    data    = x.do_slot('x')   # in R: x@x
    indices = x.do_slot('i')   # in R: x@i
    indptr  = x.do_slot('p')   # in R: x@p
    shape   = x.do_slot('Dim') # in R: x@Dim or dim(x)
    
    y = sparse.csc_matrix((data, indices, indptr), tuple(shape))
    y
    ## <6x6 sparse matrix of type '<class 'numpy.float64'>'
    ##         with 6 stored elements in Compressed Sparse Column format>
    
    print(y)
    ##   (1, 0)       3.0
    ##   (3, 1)       2.0
    ##   (0, 2)       5.0
    ##   (5, 3)       4.0
    ##   (4, 4)       1.0
    ##   (2, 5)       6.0
    

    Here, y is an object of class scipy.sparse.csc_matrix. You should not need to use the toarray method to coerce y to an array with dense storage. scipy.sparse implements all of the matrix operations that I can imagine needing. For example, here are the row and column sums of y:

    y.sum(1) # in R: as.matrix(rowSums(x))
    ## matrix([[5.],
    ##         [3.],
    ##         [6.],
    ##         [2.],
    ##         [1.],
    ##         [4.]])
    
    y.sum(0) # in R: t(as.matrix(colSums(x)))
    ## matrix([[3., 2., 5., 4., 1., 6.]])