I have a dataframe with around 300000 rows and 600 columns. I want to remove all rows which contain a sting "NULL" in at-least one of the columns. How would I achieve it without using for loops for rows and columns?
Will datatables work faster than normal dataframe in such case?
We can use rowSums
(should be fast)
df1[!rowSums(df1=='NULL'),]
Or another option is any
with apply
df1[!apply(df1=='NULL', 1, any),]
If we need a data.table
solution,
setDT(df1)[df1[,!Reduce(`+` , lapply(.SD, `==`, 'NULL'))]]
If we are reading the NULL
elements as NA
using na.strings='NULL' in the
read.table/read.csv` (as mentioned by @Roland)
setDT(df1)[df1[, !Reduce(`+`,lapply(.SD, is.na))]]
set.seed(24)
df1 <- data.frame(V1= sample(c(LETTERS[1:3],'NULL'), 20, replace=TRUE),
V2= sample(c(LETTERS[1:5], 'NULL'), 20, replace=TRUE),
V3= sample(c(LETTERS[1:8], 'NULL'), 20, replace=TRUE), stringsAsFactors=FALSE)