I have removed certain rows from my database using the following code:
df2 <- df1[!(df1$variable==1), ]
This was a dummy variable, and the rows that had the value of 1 for that particular dummy variable were successfully removed. (I checked the dimensions of my database using the "dim" function before and after; and everything seemed normal.)
However, after I ran my regression model this time with the new data set "df2", I saw that the degrees-of-freedom had fallen sharply! This was way over the number of the removed rows!
I wondered how this could happen. Then, I realized that the new data set had many rows that had NAs only. At each row that the random variable had a missing value, R had made a full row of NA values.
After realizing that the above code was not the best way to delete rows, I tried the following:
df2 <- df1[(df1$variable==0 | is.na(df1$variable)), ]
It seems to have worked, since I no longer have the same problem. But would you say that this new code above may have some (similar or other) problems that I am not really aware of right now?
The new code should be fine. The problem with the old code was caused by a combination of the NA
s in df1$variable
and the ==
comparison operator.
If you read the help on comparison operators, ?"=="
, you will see,
"Missing values (NA) and NaN values are regarded as non-comparable even to themselves, so comparisons involving them will always result in NA."
In your case, whenever the df1$variable was NA
, the results of your attempted subset was NA
(not TRUE
or FALSE
), which caused the other variables in the row to be NA
. For example:
df1 <- expand.grid(variable=c(0, 1, NA), var2=c(0, 1, NA))
sel1 <- !(df1$variable==1)
sel1
df1[sel1, ]
sel2 <- df1$variable==0 | is.na(df1$variable)
sel2
df1[sel2, ]