Search code examples
pythonripythonjupyterrpy2

rpy2 rmagic for ipython converting dashes to dots in dataframe column names


I am using rpy2 through the rmagic to interleave R code with python3 code in a jupyter notebook. A simple code cell as this:

%%R -i df -o df_out
df_out <- df

returns some column names changed, e.g. CTB-102L5.4 becomes CTB.102L5.4. I think this is related with read.table or similar (as per this answer). However I didn't find a way to specify this in the rmagic extension.

The only workaround I could think is to change the column names before passing them to R and reverting back them when the dataframe is back in python, but I'd like to find a better solution.


Solution

  • Whenever using the parameter -i <name> to "import" a Python object into R, conversion rules are applied (see here). The default converter is ending up calling R's function data.frame, which will sanitize the column names (parameter check.names=TRUE by default, see https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/data.frame) to valid-yet-unquoted symbol names. In your example, CTB-102L5.4 would otherwise be parsed as the expression CTB - 102L5.4.

    This default behaviour is not necessarily desirable in every situation, and a custom converter can be passed to the R magic %%R.

    The documentation contains a short introduction to writing custom conversion rules (https://rpy2.github.io/doc/v2.9.x/html/robjects_convert.html).

    Assuming that your input is a pandas DataFrame, you could proceed as follows:

    1- implement a variant of py2ri_pandasdataframe that does not sanitize names. Ideally by just setting check.names to FALSE, although currently not possible because of https://bitbucket.org/rpy2/rpy2/issues/455/add-parameter-to-dataframe-to-allow).

    def my_py2ri_pandasdataframe(obj):
        res = robjects.pandas2ro.py2ri_pandasdataframe(obj)
        # Set the column names in `res` to the original column names in `obj`
        # (left as an exercise for the reader)
        return res
    

    2- create a custom converter derived from the ipython converter

    import pandas
    from rpy2.ipython import rmagic
    from rpy2.robjects.conversion import Converter, localconverter
    
    my_dataf_converter = Converter('my converter')
    my_dataf_converter.py2ri.register(pandas.DataFrame,
                                      my_py2ri_pandasdataframe)
    
    my_converter = rmagic.converter + my_dataf_converter
    

    3- Use %%R with --converter=my_converter.