Search code examples
pythonrpandasdataframereticulate

Why will reticulate convert a list of pandas dfs to a list of r dfs but not if using a dictionary or nested lists?


This is my first time using reticulate. I have 20 multi-page pdf tables I'm pulling data from using camelot in python (they're not simple tables so I need the more powerful table reader). It creates a list of tables (one table for each page) and makes a TableList object. I'm able to loop over list and convert the tables to pandas dataframes. Example of doing this with one of the pdfs:

tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
df2001 = list()
for t in tables2001:
  df = t.df
  df2001.append(df)

I can then return to r, and rdf2001 <- py$df2001 gives me a list of r data.frames.

However, if I instead put the python list of dataframes into either a nested list or a dictionary containing lists, the r conversion no longer works, and the resulting nested list still contains pandas data.frames. An attempt to manually convert one of the dfs understandably gives this:

Error in as.data.frame.default(rdf2001_nested[[1]]) : 
  cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame

If I pull a single list from from a nested list into r, e.g. df2001_a <- py$df2001[1], that converts to a single list of r data.frames. I can't do the same for a dictionary, since the conversion keeps the key as a list so the nesting still exists.

The idea of using a dictionary was to get a named list in r identifying each year, since the tables themselves do not contain that information. I can work around it, but the dictionary to named list would to me the clearest way to do this assuming it would work. Trying nested lists was to figure out if the conversion issue only happened with dictionaries, which it doesn't; it's with any kind of nesting.

I'm trying to understand why this is happening. Can reticulate only convert a single level of a list? Is there an underlying reason for this or is it just that that ability hasn't been added but in theory could be?

Update with full code:

Pdf tables are here. I extracted the pages covering criminal caseloads for each year which is why pages are listed as 1-end; each has 14 pages. Python code run with repl_python() - works and gives the outcome I intend for both the list and dictionary:

import camelot
import pandas

# Lists
tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
tables2002 = camelot.read_pdf('2002.pdf', flavor='stream', pages='1-end')
tables2003 = camelot.read_pdf('2003.pdf', flavor='stream', pages='1-end')

dflist = list()
tablelist=[tables2001,tables2002,tables2003,tables2004]
for t in tablelist:
  df = t.df
  dflist.append(df)
  
# Dictionary - I got help with this from someone who is knows python well
tables = { f'20{str(n).zfill(2)}': camelot.read_pdf(f'20{str(n).zfill(2)}.pdf',
flavor='stream', pages='1-end', table_regions=['50,580,780,50']) for n in range(1,3)}

dfdict = { k: [df.df for df in v] for k, v in tables.items() }

R code:

library(reticulate)

# List
rdflist <- py$dflist

# Dictionary
rdfdict <- py$dfdict

rdflist is a list of data.frames. rdfdict is a named nested list, containing 3 lists (2001, 2002, 2003), each with 14 pandas dataframes, i.e. not usable in r.

class(rdflist[[1]])
[1] "data.frame"
class(rdfdict[[1]][[1]])
[1] "pandas.core.frame.DataFrame"        "pandas.core.generic.NDFrame"       
[3] "pandas.core.base.PandasObject"      "pandas.core.base.StringMixin"      
[5] "pandas.core.accessor.DirNamesMixin" "pandas.core.base.SelectionMixin"   
[7] "python.builtin.object"  

Attempt to coerce a single df to data.frame:

as.data.frame(rdfdict[[1]][[1]])
Error in as.data.frame.default(rdfdict[[1]][[1]]) : 
  cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame

Solution

  • Comparing both versions, you run a couple of differences for the dictionary version including an additional argument, table_regions and an extra nested looping in the dictionary comprehension: [df.df for df in v] (interestingly did not raise an error in Python).

    Consider adjusting for consistency for comparable returned values. By the way, in Python, you can also run list comprehension similar to dict comprehension.

    Python

    import camelot 
    import pandas as pd
    
    # LIST COMPREHENSION
    pydf_list = [
        [tbl.df for tbl in camelot.read_pdf(f'{yr}.pdf', flavor='stream', pages='1-end')]
        for yr in range(2001, 2004)
    ]
    
    # DICT COMPREHENSION
    pydf_dict = {
        str(yr): [tbl.df for tbl in camelot.read_pdf(f'{yr}.pdf', flavor='stream', pages='1-end')]
        for yr in range(2001, 2004)
    }
    

    R

    library(reticulate)
    
    reticulate::source_python("myscript.py")
    
    # NESTED LIST 
    rdf_list <- reticulate::py$pydf_list 
    
    # NESTED NAMED LIST 
    rdf_dict <- reticulate::py$pydf_dict
    

    However, as you indicate I do reproduce the problematic dict conversion to named list using a reproducible example. Reporting this issue, one suggestion of maintainer is to use py_to_r:

    rdf_dict2 <- lapply(rdf_dict, function(lst) lapply(lst, py_to_r))