Search code examples
juliananboxplotmissing-data

ArgumentError: quantiles are undefined in presence of NaNs or missing values


I would like to create a boxplot that contains some missing values in Julia. Here is some reproducible code:

using DataFrames
using StatsPlots
df = DataFrame(y = [1,2,3,2,1,2,4,NaN,NaN,2,1])

boxplot(df[!, "y"])

Output:

ArgumentError: quantiles are undefined in presence of NaNs or missing values

I know that the error happens because of the NaN values, but is there not an option in boxplot to still plot the values instead of removing the missing values beforehand? I would assume that it might be designed in a way that it works in presence of missing values. In R it will still plot the boxplot, so I was wondering why in Julia you must remove these missing values and what is an appropriate way to do this?


Solution

  • so I was wondering why in Julia you must remove these missing values

    So the general reason is difference in philosophy of design behind R and Julia. R was designed to be maximally convenient at the risk of doing an incorrect thing sometimes. It tries to guess what you most likely want and does this. In this case - you most likely want NaN values to be ignored.

    Julia is designed for safety and production use. If you have NaN in your data it means that data preparation process had some serious issue (like division of 0 by 0). In production scenarios you want your code to error in such cases as otherwise it is hard to identify the root cause of the issue.

    Now, seconding what Dan Getz commented - most likely your NaN is actually missing (as you refer to it as missing). These two should not be mixed and have a significantly different interpretation. NaN is a value that is undefined or unrepresentable, especially in floating-point arithmetic (e.g. 0 divided by 0). While missing is a value that is missing (e.g. we have not collected a measurement).

    Still - even if your data contained missing you would get an error for the same safety reason.

    what is an appropriate way to do this?

    NaNs are very rare in practice, so what Dan Getz recommended is a typical way to filter them. Other would be [x for x in df.y if !isnan(x)].

    If you had missing values in your data (as this is most likely what you want) you should write boxplot(skipmissing(df.y)).