Search code examples
pythonrpandasdataframesmote

How to get proper value instead of "NA_character_" in pandas dataframe while calling R function from Python?


I'm calling a r-function from python script to apply smote on a dummy dataset. Here the majority class is 0(90%) and minority class is 1(10%). While calling r function directly giving me proper output but getting NA_character_ from same function calling from python. Below is the r function -

# file r_test.r
library(performanceEstimation)

rtest <- function(r_df, over_val, under_val) {
  set.seed(0)
  new_df <- smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val,  k = 5)
  table(new_df$y)
  return(new_df)
}

below is the python code to call this function -

import os
import numpy as np
import pandas as pd

import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

from sklearn.datasets import make_classification

def function2(r_df, over_val, under_val):
    r=ro.r
    r.source(path)
    p=r.rtest(r_df, over_val, under_val)
    return p

path=os.path.join(os.getcwd(), "r_test.r")

X, y = make_classification(n_classes=2,
    class_sep=2, 
    weights=[0.90, 0.10], 
    n_informative=4, 
    n_redundant=1, 
    flip_y=0,
    n_features=5, 
    n_clusters_per_class=1,
    n_samples=100,
    random_state=10)

df = pd.DataFrame(X, columns = ["x1", "x2", "x3", "x4", "x5"])
df['y'] = y
df['y'].value_counts()

Output -

0    90
1    10
Name: y, dtype: int64
base = importr('base')

with localconverter(ro.default_converter + pandas2ri.converter):
    r_from_pd_df = ro.conversion.py2rpy(df)
    
with localconverter(ro.default_converter + pandas2ri.converter):
    pd_from_r_df = ro.conversion.rpy2py(function2(r_from_pd_df, 5, 2))

pd_from_r_df['y'].value_counts()

Output -

0                100
NA_character_     50
1                 10
Name: y, dtype: int64

Number of NA_character_ is the exact number of minority class samples this smote function should generate. What mistake I'm making with the above code and instead of NA_character_, how could I get 1s? Note - completely new to R-language. If there is any problem in R code then please specify it with complete example.


Solution

  • Try converting that y column to factor first. Some other implementations (like themis::smote() ) will treat you with a nice informative error if types don't match.

    Walk-through with reticulate, Python from R:

    library(reticulate)
    library(performanceEstimation)
    
    # original:
    rtest <- function(r_df, over_val, under_val) {
      set.seed(0)
      new_df <- smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val,  k = 5)
      table(new_df$y)
      return(new_df)
    }
    
    py_run_string('
    from sklearn.datasets import make_classification
    import pandas as pd
    
    X, y = make_classification(n_classes=2,
        class_sep=2, 
        weights=[0.90, 0.10], 
        n_informative=4, 
        n_redundant=1, 
        flip_y=0,
        n_features=5, 
        n_clusters_per_class=1,
        n_samples=100,
        random_state=10)
    
    df = pd.DataFrame(X, columns = ["x1", "x2", "x3", "x4", "x5"])
    df["y"] = y
    df["y"].value_counts()')
    
    # py$ to access objects in reticulate python environment
    # check initial state
    str(py$df)
    #> 'data.frame':    100 obs. of  6 variables:
    #>  $ x1: num  -0.00637 2.47159 3.32977 2.38089 3.59025 ...
    #>  $ x2: num  1.78 -1.34 -3.3 -1.92 -1.51 ...
    #>  $ x3: num  -1.6937 -0.0247 1.3269 0.0854 0.5175 ...
    #>  $ x4: num  1.8 2.57 1.72 1.97 1.93 ...
    #>  $ x5: num  0.407 -1.455 -2.571 -2.15 -2.427 ...
    #>  $ y : num  0 0 0 0 0 0 0 0 0 0 ...
    #>  - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
    table(py$df$y)
    #> 
    #>  0  1 
    #> 90 10
    
    # apply rtest
    new_df <- rtest(py$df, 5, 2)
    # and check results
    str(new_df)
    #> 'data.frame':    160 obs. of  6 variables:
    #>  $ x1: num  2.479 1.694 2.314 2.774 0.626 ...
    #>  $ x2: num  -1.5 -2.82 -1.62 -2.01 -1.54 ...
    #>  $ x3: num  0.65 0.496 -0.714 0.336 -1.025 ...
    #>  $ x4: num  1.7 1.2 2.74 1.38 2.41 ...
    #>  $ x5: num  -1.31 -2.21 -2.38 -2.72 -1.37 ...
    #>  $ y : chr  "0" "0" "0" "0" ...
    #>  - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
    table(new_df$y)
    #> 
    #>   0   1 
    #> 100  10
    
    # but there should be 160 observations in total ...
    # letch check the tail
    tail(new_df)
    #>            x1        x2          x3       x4       x5    y
    #> 451 -2.601792 -2.428654 -0.29214031 2.509291 2.282252 <NA>
    #> 461 -2.553342 -2.445119 -0.22325568 2.487501 2.303546 <NA>
    #> 471 -2.334285 -2.270024 -0.12400004 2.349623 2.256293 <NA>
    #> 48  -2.228444 -2.429856  0.08238596 2.431736 2.391834 <NA>
    #> 491 -2.636799 -2.416758 -0.34191319 2.525036 2.266866 <NA>
    #> 50  -2.070363 -2.569577  0.43053504 2.234752 2.470430 <NA>
    
    # so apparently there are NA values,
    # but table() does not include those by default
    table(new_df$y, useNA = "ifany") 
    #> 
    #>    0    1 <NA> 
    #>  100   10   50
    

    Let's modify that function for a better match with examples in ?smote , i.e. turn response into factor:

    rtest2 <- function(r_df, over_val, under_val) {
      set.seed(0)
      r_df$y <- as.factor(r_df$y)
      smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val,  k = 5)
    }
    
    new_df2 <- rtest2(py$df, 5, 2)
    str(new_df2)
    #> 'data.frame':    160 obs. of  6 variables:
    #>  $ x1: num  2.479 1.694 2.314 2.774 0.626 ...
    #>  $ x2: num  -1.5 -2.82 -1.62 -2.01 -1.54 ...
    #>  $ x3: num  0.65 0.496 -0.714 0.336 -1.025 ...
    #>  $ x4: num  1.7 1.2 2.74 1.38 2.41 ...
    #>  $ x5: num  -1.31 -2.21 -2.38 -2.72 -1.37 ...
    #>  $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
    #>  - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
    
    # and lets check our new response distribution:
    table(new_df2$y, useNA = "ifany") 
    #> 
    #>   0   1 
    #> 100  60
    
    # counts from python (`r.` to access R objects):
    py_eval("r.new_df2['y'].value_counts()")
    #> 0    100
    #> 1     60
    #> Name: y, dtype: int64
    

    Created on 2023-09-30 with reprex v2.0.2