I am using this dataset breastcancer
from UCI but it contains missing values. Can anyone help me to fix it? I am new to ML and I don't know a lot about missing values techniques. Here is the link for dataset cancerdata.
I tried this code on R :
data <- read.csv('D:/cancer.csv', header=FALSE) # Reading the data
for(i in 1:ncol(data)) {
data[is.na(data[,i]), i] <- mean(data[,i], na.rm=TRUE)
}
but it gives me an error (sorry it may be trivial but I am really pretty new here is a screenshot of the
thank you for your time and consideration
Try the missForest package in R: https://cran.r-project.org/web/packages/missForest/missForest.pdf
It is really easy to use, fast and does a great job imputing categorical and numeric values.
For a quick tutorial, see here: https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
Edit:
You have total 16 missing values in the data, all in column 7 (V7
). You can check this by
data <- read.csv('D:/cancer.csv', header=FALSE) # Reading the data
sum(data == "?")
sum(data$V7 == "?")
Now, missForest will impute all in NAs in data, no matter where they are. If you want to retain some NAs, separate that data first.
To impute all NAs:
data[data == "?"] <- NA
library(missForest)
data <- missForest(data)$ximp
Now all the NAs have been imputed and replaced with some meaningful values. To verify this:
sum(is.na(data))
Use this data with imputed values.