Search code examples
rsvmscatter-plot

Is there an Approach to calculate an overlapping scatterplot between groups so it is able to be classified with SVM Models?


To clarify this matters, I used some datasets to interpret a variants of 2 Dimensional data

the dataset can be accessed in: https://drive.google.com/file/d/14-VivVlGSlaJo6BXlYMqn-1leorSU6ET/view?usp=sharing

and also a helper function:

scatterplot_check <- function(data, dependent_col, x_column, y_column, legend_pos="topright"){
  x11()
  data_subsets <- data[,c(which(colnames(data) %in% c(dependent_col, x_column, y_column)))]
  if(class(data_subsets[[dependent_col]]) == "factor"){
    factor_key <- levels(data_subsets[[dependent_col]])
    data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
    factor_num <- sort(unique(data_subsets[[dependent_col]]))
    plot(data_subsets[[x_column]],data_subsets[[y_column]], 
         col = data_subsets[[dependent_col]], pch=18, 
         xlab=x_column, ylab=y_column)
    legend(legend_pos, legend=factor_key, col = factor_num, pch=18) 
  }
  else if(class(data_subsets[[dependent_col]]) == "character"){
    data_subsets[[dependent_col]] <- as.factor(data_subsets[[dependent_col]])
    factor_key <- levels(data_subsets[[dependent_col]])
    data_subsets[[dependent_col]] <- as.numeric(data_subsets[[dependent_col]])
    factor_num <- sort(unique(data_subsets[[dependent_col]]))
    plot(data_subsets[[x_column]],data_subsets[[y_column]], 
         col = data_subsets[[dependent_col]], pch=18, 
         xlab=x_column, ylab=y_column)
    legend(legend_pos, legend=factor_key, col = factor_num, pch=18) 
  }
  else if(class(data_subsets[[dependent_col]]) == "integer"){
    if(min(data_subsets[[dependent_col]]) == 0){
      data_subsets[[dependent_col]] <- data_subsets[[dependent_col]] + 1
      plot(data_subsets[[x_column]],data_subsets[[y_column]], 
           col = data_subsets[[dependent_col]], pch=18, 
           xlab=x_column, ylab=y_column)
      legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]]-1)), 
             col = sort(unique(data_subsets[[dependent_col]])), pch=18) 
    }else{
      plot(data_subsets[[x_column]],data_subsets[[y_column]], 
           col = data_subsets[[dependent_col]], pch=18, 
           xlab=x_column, ylab=y_column)
      legend(legend_pos, legend=sort(unique(data_subsets[[dependent_col]])), 
             col = sort(unique(data_subsets[[dependent_col]])), pch=18) 
    }
  }
}

Suppose, I read all the data into the environment with:

dataset1 <- read.csv("dataset1.csv")
dataset2 <- read.csv("dataset2.csv")
dataset3 <- read.csv("dataset3.csv")

And here is some variants of scatterplot:

scatterplot_check(dataset1, "y","x.1","x.2")

(This is likely to capable to be classified as SVM Models) This is Most likely to capable to be classified as SVM Models

scatterplot_check(dataset2, "Purchased","Age","EstimatedSalary")

This is Also likely to capable to be classified as SVM Models enter image description here

scatterplot_check(dataset3, "grades","english","math")

This is Not likely to capable to be classified as SVM Models enter image description here

scatterplot_check(dataset3, "grades","read","math", legend_pos="topleft")

This is Not likely to capable to be classified as SVM Models enter image description here

Is there any best approach to compute the likeliness of 2D Scatterplot to be modeled with SVM Model?


Solution

  • I am Spending some thoughts on making this, While I think it may have a future weaknesses, I think this should be my custom approach to calculate overlapping scatterplot between groups, The Steps are:

    1. Calculate Percentage of X and Y variable in a sequences of ranges
    2. Define a Percentage Threshold (in my case I use 5%)
    3. Check the Results of X and Y distribution by 5% Percent filtering, if all X and Y variable have the same sequence distribution in each class. it is unlikely to be modeled as SVM since it shows independency over a selected Class, in other hand if any X and Y variable have different sequence distribution in each class, it is likely to be modeled as SVM since it shows different distribution with selected Class

    Here is the result when I implemented it to those 4 cases:

    d1_compare <- dataset_class_comparison(dataset1, "y", "x.1", "x.2")
    ============================================================================
    Class = -1
    SeqX(-10,10,1)
    SeqY(-10,10,1)
    x.1_-2 to -1 (pct)  x.1_-1 to 0 (pct)   x.1_0 to 1 (pct)   x.1_1 to 2 (pct) 
                  0.16               0.38               0.30               0.10 
    x.2_-2 to -1 (pct)  x.2_-1 to 0 (pct)   x.2_0 to 1 (pct)   x.2_1 to 2 (pct) 
                  0.14               0.28               0.46               0.08 
    ============================================================================
    ============================================================================
    Class = 1
    SeqX(-10,10,1)
    SeqY(-10,10,1)
    x.1_-1 to 0 (pct)  x.1_1 to 2 (pct)  x.1_2 to 3 (pct)  x.1_3 to 4 (pct) 
                 0.08              0.42              0.36              0.08 
    x.2_-1 to 0 (pct)  x.2_0 to 1 (pct)  x.2_1 to 2 (pct)  x.2_2 to 3 (pct)  x.2_3 to 4 (pct) 
                 0.06              0.26              0.38              0.20              0.06 
    ============================================================================
    Conclusion: Since each class within a 5% threshold not having similiar distribution from x.1 or x.2
    SVM Likely can be modeled
    
    
    d2_compare <- dataset_class_comparison(dataset2, "Purchased", "Age", "EstimatedSalary")
    ============================================================================
    Class = 0
    SeqX(10,100,10)
    SeqY(10000,1e+06,10000)
    Age_10 to 20 (pct) Age_20 to 30 (pct) Age_30 to 40 (pct) Age_40 to 50 (pct) 
                 0.066              0.325              0.413              0.178 
    EstimatedSalary_10000 to 20000 (pct) EstimatedSalary_20000 to 30000 (pct) EstimatedSalary_30000 to 40000 (pct) 
                                   0.063                                0.077                                0.059 
    EstimatedSalary_40000 to 50000 (pct) EstimatedSalary_50000 to 60000 (pct) EstimatedSalary_60000 to 70000 (pct) 
                                   0.098                                0.182                                0.112 
    EstimatedSalary_70000 to 80000 (pct) EstimatedSalary_80000 to 90000 (pct) 
                                   0.210                                0.150 
    ============================================================================
    ============================================================================
    Class = 1
    SeqX(10,100,10)
    SeqY(10000,1e+06,10000)
    Age_30 to 40 (pct) Age_40 to 50 (pct) Age_50 to 60 (pct) 
                 0.222              0.392              0.304 
      EstimatedSalary_20000 to 30000 (pct)   EstimatedSalary_30000 to 40000 (pct)   EstimatedSalary_40000 to 50000 (pct) 
                                     0.123                                  0.105                                  0.056 
      EstimatedSalary_70000 to 80000 (pct)   EstimatedSalary_80000 to 90000 (pct)   EstimatedSalary_90000 to 1e+05 (pct) 
                                     0.080                                  0.080                                  0.074 
     EstimatedSalary_1e+05 to 110000 (pct) EstimatedSalary_110000 to 120000 (pct) EstimatedSalary_120000 to 130000 (pct) 
                                     0.093                                  0.062                                  0.062 
    EstimatedSalary_130000 to 140000 (pct) EstimatedSalary_140000 to 150000 (pct) 
                                     0.093                                  0.099 
    ============================================================================
    Conclusion: Since each class within a 5% threshold not having similiar distribution from Age or EstimatedSalary
    SVM Likely can be modeled
    
    
    d3_compare <- dataset_class_comparison(dataset3, "grades", "english", "math")
    ============================================================================
    Class = KK-08
    SeqX(0,100,10)
    SeqY(100,1000,100)
     english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct) 
                     0.571                  0.162                  0.061                  0.084                  0.056 
    math_600 to 700 (pct) 
                    0.989 
    ============================================================================
    ============================================================================
    Class = KK-06
    SeqX(0,100,10)
    SeqY(100,1000,100)
     english_0 to 10 (pct) english_10 to 20 (pct) english_20 to 30 (pct) english_30 to 40 (pct) english_40 to 50 (pct) 
                     0.377                  0.262                  0.098                  0.131                  0.066 
    math_600 to 700 (pct) 
                    0.984 
    ============================================================================
    Conclusion: Since each class within a 5% threshold having similiar distribution either from english and math
    SVM Unlikely can be modeled
    
    
    
    d4_compare <- dataset_class_comparison(dataset3, "grades", "math", "read")
     ============================================================================
    Class = KK-08
    SeqX(100,1000,100)
    SeqY(100,1000,100)
    math_600 to 700 (pct) 
                    0.989 
    read_600 to 700 (pct) 
                    0.992 
    ============================================================================
    ============================================================================
    Class = KK-06
    SeqX(100,1000,100)
    SeqY(100,1000,100)
    math_600 to 700 (pct) 
                    0.984 
    read_600 to 700 (pct) 
                        1 
    ============================================================================
    Conclusion: Since each class within a 5% threshold having similiar distribution either from math and read
    SVM Unlikely can be modeled
    

    dataset_class_comparison is a customized function with over 300 lines, that can be found in https://drive.google.com/file/d/1RmIhbNnKZWS2jFIsS9p4LWjhcbikpOga/view?usp=sharing