I would like to create a boxplot that contains some missing values in Julia. Here is some reproducible code:
using DataFrames
using StatsPlots
df = DataFrame(y = [1,2,3,2,1,2,4,NaN,NaN,2,1])
boxplot(df[!, "y"])
Output:
ArgumentError: quantiles are undefined in presence of NaNs or missing values
I know that the error happens because of the NaN
values, but is there not an option in boxplot
to still plot the values instead of removing the missing values beforehand? I would assume that it might be designed in a way that it works in presence of missing values. In R it will still plot the boxplot, so I was wondering why in Julia
you must remove these missing values and what is an appropriate way to do this?
so I was wondering why in Julia you must remove these missing values
So the general reason is difference in philosophy of design behind R and Julia.
R was designed to be maximally convenient at the risk of doing an incorrect thing sometimes. It tries to guess what you most likely want and does this. In this case - you most likely want NaN
values to be ignored.
Julia is designed for safety and production use. If you have NaN
in your data it means that data preparation process had some serious issue (like division of 0 by 0). In production scenarios you want your code to error in such cases as otherwise it is hard to identify the root cause of the issue.
Now, seconding what Dan Getz commented - most likely your NaN
is actually missing
(as you refer to it as missing). These two should not be mixed and have a significantly different interpretation. NaN
is a value that is undefined or unrepresentable, especially in floating-point arithmetic (e.g. 0 divided by 0). While missing
is a value that is missing (e.g. we have not collected a measurement).
Still - even if your data contained missing
you would get an error for the same safety reason.
what is an appropriate way to do this?
NaN
s are very rare in practice, so what Dan Getz recommended is a typical way to filter them. Other would be [x for x in df.y if !isnan(x)]
.
If you had missing
values in your data (as this is most likely what you want) you should write boxplot(skipmissing(df.y))
.