Search code examples
rmissing-dataimputation

missing values,classification task


I am using this dataset breastcancer from UCI but it contains missing values. Can anyone help me to fix it? I am new to ML and I don't know a lot about missing values techniques. Here is the link for dataset cancerdata.

I tried this code on R :

data <- read.csv('D:/cancer.csv', header=FALSE)  # Reading the data 

for(i in 1:ncol(data)) {
    data[is.na(data[,i]), i] <- mean(data[,i], na.rm=TRUE)
}

but it gives me an error (sorry it may be trivial but I am really pretty new here is a screenshot of the

thank you for your time and consideration

here is the output I have


Solution

  • Try the missForest package in R: https://cran.r-project.org/web/packages/missForest/missForest.pdf

    It is really easy to use, fast and does a great job imputing categorical and numeric values.

    For a quick tutorial, see here: https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

    Edit: You have total 16 missing values in the data, all in column 7 (V7). You can check this by

    data <- read.csv('D:/cancer.csv', header=FALSE)  # Reading the data
    sum(data == "?")
    sum(data$V7 == "?")
    

    Now, missForest will impute all in NAs in data, no matter where they are. If you want to retain some NAs, separate that data first.

    To impute all NAs:

    data[data == "?"] <- NA
    library(missForest)
    data <- missForest(data)$ximp
    

    Now all the NAs have been imputed and replaced with some meaningful values. To verify this:

    sum(is.na(data))
    

    Use this data with imputed values.