Search code examples
python-2.7pandasrpy2

How to convert a rpy2 matrix object into a Pandas data frame?


After reading in a .csv file using pandas, and then converting it into an R dataframe using the rpy2 package, I created a model using some R functions (also via rpy2), and now want to take the summary of the model and convert it into a Pandas dataframe (so that I can either save it as a .csv file or use it for other purposes).

I have followed out the instructions on the pandas site (source: https://pandas.pydata.org/pandas-docs/stable/r_interface.html) in order to figure it out:

import pandas as pd
from rpy2.robjects import r
import sys
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects import r, pandas2ri 

pandas2ri.activate()
caret = rpackages.importr('caret')
broom= rpackages.importr('broom')

my_data= pd.read_csv("my_data.csv")
r_dataframe= pandas2ri.py2ri(my_data)

preprocessing= ["center", "scale"]
center_scale= StrVector(preprocessing)

#these are the columns in my data frame that will consist of my predictors in the model
predictors= ['predictor1','predictor2','predictor3']
predictors_vector= StrVector(predictors)

#this column from the dataframe consists of the outcome of the model
outcome= ['fluorescence']
outcome_vector= StrVector(outcome)

#this line extracts the columns of the predictors from the dataframe
columns_predictors= r_dataframe.rx(True, columns_vector)

#this line extracts the column of the outcome from the dataframe
column_response= r_dataframe.rx(True, column_response)

cvCtrl = caret.trainControl(method = "repeatedcv", number= 20, repeats = 100)

model_R= caret.train(columns_predictors, columns_response, method = "glmStepAIC", preProc = center_scale, trControl = cvCtrl)

summary_model= base.summary(model_R)

coefficients= stats.coef(summary_model)

pd_dataframe = pandas2ri.ri2py(coefficients)

pd_dataframe.to_csv("coefficents.csv")

Although this workflow is ostensibly correct, the output .csv file did not meet my needs, as the names of the columns and rows were removed. When I ran the command type(pd_dataframe), I find that it is a <type 'numpy.ndarray'>. Although the information of the table is still present, the new formatting has removed the names of the columns and rows.

So I ran the command type(coefficients) and found that it was a <class 'rpy2.robjects.vectors.Matrix'>. Since this Matrix object still retained the names of my columns and rows, I tried to convert it into an R objects DataFrame, but my efforts proved to be futile. Furthermore, I don't know why the line pd_dataframe = pandas2ri.ri2py(coefficients) did not yield a pandas DataFrame object, nor why it did not retain the names of my columns and rows.

Can anybody recommend an approach so I can get some kind of pandas DataFrame that retains the names of my columns and rows?

UPDATE

A new method was mentioned in the documents of a slightly older version of the package called pandas2ri.ri2py_dataframe (source: https://rpy2.readthedocs.io/en/version_2.7.x/changes.html), and now I have a proper data frame instead of the numpy array. However, I still can't get the names of the rows and columns to be transferred properly. Any suggestions?


Solution

  • May be it should happen automatically during conversion, but in the meantime row and column names can easily be obtained from the R object and added to the pandas DataFrame. For example the column names for the R matrix should be at: https://rpy2.github.io/doc/v2.9.x/html/vector.html#rpy2.robjects.vectors.Matrix.colnames