I'm calling a r-function from python script to apply smote on a dummy dataset. Here the majority class is 0(90%) and minority class is 1(10%). While calling r function directly giving me proper output but getting NA_character_
from same function calling from python. Below is the r function -
# file r_test.r
library(performanceEstimation)
rtest <- function(r_df, over_val, under_val) {
set.seed(0)
new_df <- smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val, k = 5)
table(new_df$y)
return(new_df)
}
below is the python code to call this function -
import os
import numpy as np
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
from sklearn.datasets import make_classification
def function2(r_df, over_val, under_val):
r=ro.r
r.source(path)
p=r.rtest(r_df, over_val, under_val)
return p
path=os.path.join(os.getcwd(), "r_test.r")
X, y = make_classification(n_classes=2,
class_sep=2,
weights=[0.90, 0.10],
n_informative=4,
n_redundant=1,
flip_y=0,
n_features=5,
n_clusters_per_class=1,
n_samples=100,
random_state=10)
df = pd.DataFrame(X, columns = ["x1", "x2", "x3", "x4", "x5"])
df['y'] = y
df['y'].value_counts()
Output -
0 90
1 10
Name: y, dtype: int64
base = importr('base')
with localconverter(ro.default_converter + pandas2ri.converter):
r_from_pd_df = ro.conversion.py2rpy(df)
with localconverter(ro.default_converter + pandas2ri.converter):
pd_from_r_df = ro.conversion.rpy2py(function2(r_from_pd_df, 5, 2))
pd_from_r_df['y'].value_counts()
Output -
0 100
NA_character_ 50
1 10
Name: y, dtype: int64
Number of NA_character_ is the exact number of minority class samples this smote function should generate. What mistake I'm making with the above code and instead of NA_character_, how could I get 1s? Note - completely new to R-language. If there is any problem in R code then please specify it with complete example.
Try converting that y
column to factor first.
Some other implementations (like themis::smote()
) will treat you with a
nice informative error if types don't match.
Walk-through with reticulate
, Python from R:
library(reticulate)
library(performanceEstimation)
# original:
rtest <- function(r_df, over_val, under_val) {
set.seed(0)
new_df <- smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val, k = 5)
table(new_df$y)
return(new_df)
}
py_run_string('
from sklearn.datasets import make_classification
import pandas as pd
X, y = make_classification(n_classes=2,
class_sep=2,
weights=[0.90, 0.10],
n_informative=4,
n_redundant=1,
flip_y=0,
n_features=5,
n_clusters_per_class=1,
n_samples=100,
random_state=10)
df = pd.DataFrame(X, columns = ["x1", "x2", "x3", "x4", "x5"])
df["y"] = y
df["y"].value_counts()')
# py$ to access objects in reticulate python environment
# check initial state
str(py$df)
#> 'data.frame': 100 obs. of 6 variables:
#> $ x1: num -0.00637 2.47159 3.32977 2.38089 3.59025 ...
#> $ x2: num 1.78 -1.34 -3.3 -1.92 -1.51 ...
#> $ x3: num -1.6937 -0.0247 1.3269 0.0854 0.5175 ...
#> $ x4: num 1.8 2.57 1.72 1.97 1.93 ...
#> $ x5: num 0.407 -1.455 -2.571 -2.15 -2.427 ...
#> $ y : num 0 0 0 0 0 0 0 0 0 0 ...
#> - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
table(py$df$y)
#>
#> 0 1
#> 90 10
# apply rtest
new_df <- rtest(py$df, 5, 2)
# and check results
str(new_df)
#> 'data.frame': 160 obs. of 6 variables:
#> $ x1: num 2.479 1.694 2.314 2.774 0.626 ...
#> $ x2: num -1.5 -2.82 -1.62 -2.01 -1.54 ...
#> $ x3: num 0.65 0.496 -0.714 0.336 -1.025 ...
#> $ x4: num 1.7 1.2 2.74 1.38 2.41 ...
#> $ x5: num -1.31 -2.21 -2.38 -2.72 -1.37 ...
#> $ y : chr "0" "0" "0" "0" ...
#> - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
table(new_df$y)
#>
#> 0 1
#> 100 10
# but there should be 160 observations in total ...
# letch check the tail
tail(new_df)
#> x1 x2 x3 x4 x5 y
#> 451 -2.601792 -2.428654 -0.29214031 2.509291 2.282252 <NA>
#> 461 -2.553342 -2.445119 -0.22325568 2.487501 2.303546 <NA>
#> 471 -2.334285 -2.270024 -0.12400004 2.349623 2.256293 <NA>
#> 48 -2.228444 -2.429856 0.08238596 2.431736 2.391834 <NA>
#> 491 -2.636799 -2.416758 -0.34191319 2.525036 2.266866 <NA>
#> 50 -2.070363 -2.569577 0.43053504 2.234752 2.470430 <NA>
# so apparently there are NA values,
# but table() does not include those by default
table(new_df$y, useNA = "ifany")
#>
#> 0 1 <NA>
#> 100 10 50
Let's modify that function for a better match with examples in ?smote
, i.e. turn response into factor:
rtest2 <- function(r_df, over_val, under_val) {
set.seed(0)
r_df$y <- as.factor(r_df$y)
smote(y ~ ., r_df, perc.over = over_val, perc.under = under_val, k = 5)
}
new_df2 <- rtest2(py$df, 5, 2)
str(new_df2)
#> 'data.frame': 160 obs. of 6 variables:
#> $ x1: num 2.479 1.694 2.314 2.774 0.626 ...
#> $ x2: num -1.5 -2.82 -1.62 -2.01 -1.54 ...
#> $ x3: num 0.65 0.496 -0.714 0.336 -1.025 ...
#> $ x4: num 1.7 1.2 2.74 1.38 2.41 ...
#> $ x5: num -1.31 -2.21 -2.38 -2.72 -1.37 ...
#> $ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "pandas.index")=RangeIndex(start=0, stop=100, step=1)
# and lets check our new response distribution:
table(new_df2$y, useNA = "ifany")
#>
#> 0 1
#> 100 60
# counts from python (`r.` to access R objects):
py_eval("r.new_df2['y'].value_counts()")
#> 0 100
#> 1 60
#> Name: y, dtype: int64
Created on 2023-09-30 with reprex v2.0.2