Search code examples
rlogistic-regressionsmote

Why do I get 'Error in T[, col] <- data[, col]' when I use SMOTE in R?


I have a big dataset of fire occurring in forests, and I want to predict when the fire ignites. This happens very rarely: 290 times out of 620 000 times.

A tibble: 62,905 x 13
   amplitude polarity DEM_avg   DC   DMC   DSR    FFMC    Pd    RH  TEMP  WS  tree_cover  fire
       <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl> <fct>
 1     -37.8      0     165.   269.  21.9  0.607  84.0   0    65.1  290. 4.36      8        0
 2     -68.1      0     303.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     34.7     0
 3     -54.3      0     332.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     35.8     1
 4    -108.       0     338.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     30.3     0
 5     -60.3      0     374.   171.  35.7  2.30   88.9   0.3  51.7  295. 4.01      29.6     1
 6     -82.8      0     48.2   133.  18.4  0.210  84.9   0    65.1  289. 1.35      18.7     0
 7     -99.6      0     299.   219.  42.6  2.09   90.8   0    34.2  297. 1.42       7       1
 8     -98.1      0     116.   153.  44.7  0.988  89.0   0    41.3  298. 0.235     32.6     0

I tried to use SMOTE to balance my highly imbalanced dataset with the changes suggested by StupidWolf. I do the following:

library(readr)
library(tidyverse)
library(caret)
library(DMwR)
data <- read_csv("data/fire2018.csv", 
    col_types = cols(fire = col_factor(levels = c("0", 
        "1"))))
training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]
SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 600, perc.under = 100)

However, when I use SMOTE from the DMwR package I now get the following error:

Error in factor(newCases[, a], levels = 1:nlevels(data[, a]), labels = levels(data[,  : 
  invalid 'labels'; length 0 should be 1 or 2
In addition: Warning messages:
1: In if (class(data[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
2: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion
3: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion

I have looked for different solutions. One suggested transforming variables into numeric and factor, but my variables are already transformed correctly. My dependent variable is factor w/ 2 levels and the independent variables are numeric, and I have no N/A in any of my variables. But, that did not help my case. I got a similar error.


Solution

  • So, after spending hours on this problem. I finally with help from StupidWolf came to the following solution: I had to clean up my dataset, which included a lot of different variables that I did not use. Here, there were N/A's. Apparently, R could not handle that while I was not using the variable anyhow. So to sum it up. I ended up changing the data part in the SMOTE function to data.frame. My code ended like this:

    library(readr)
    library(tidyverse)
    library(caret)
    library(DMwR)
    data <- read_csv("data/test.csv", 
    +                  col_types = cols(fire = col_factor(levels = c("0", 
    +                                                                "1"))))
    training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
    train.data  <- data[training.samples, ]
    test.data <- data[-training.samples, ]
    newData <- SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 10000, perc.under = 1000)