I have written a custom function ord_table()
to extract summary statistics from a series of databases. To get those summary statistics, I have to filter out missing data codes (all codes are large negative numbers).
> data
var1 var2 var3 var4
1 3 2 2 1
2 -99999 3 1 1
3 3 5 -99999 2
4 2 3 3 3
5 2 3 4 4
6 -99999 1 -99999 5
7 3 4 1 5
8 5 1 3 3
9 4 1 5 -99999
10 -99999 5 4 1
11 -99999 3 2 2
12 1 2 5 5
The column names are in a vector ord_vars
.
I am applying the function with lapply
.
ordinal <- lapply(ord_vars, ord_table)
Here is my function:
ord_table <- function(x) {
table_ord <- data.frame(Median=double(length=1),
Minimum=double(length=1),
Maximum=double(length=1),
IQR=double(length=1))
table_ord$Median <- data %>% subset(x > 0) %>% pull(x) %>% median()
table_ord$Minimum <- data %>% subset(x > 0) %>% pull(x) %>% min()
table_ord$Maximum <- data %>% subset(x > 0) %>% pull(x) %>% max()
table_ord$IQR <- data %>% subset(x > 0) %>% pull(x) %>% IQR()
return(table_ord)
}
It is all working except for the subsetting to remove the negative values that code for missing data.
I have tried filter()
and subset()
, and both with unquote(x)
instead of x
.
For each dataframe, I get results including missing data codes:
[[1]]
Median Minimum Maximum IQR
1 2 -99999 5 100002.2
Where I want to get:
[[1]]
Median Minimum Maximum IQR
1 2.5 2 5 2.25
set.seed(123) ## for sake of reproducibility
a <- c(1, 2, 3, 4, 5, -99999)
var1 <- sample(a, 12, replace=TRUE)
var2 <- sample(a, 12, replace=TRUE)
var3 <- sample(a, 12, replace=TRUE)
var4 <- sample(a, 12, replace=TRUE)
data <- cbind(var1, var2, var3, var4) %>% as.data.frame()
ord_vars <- data %>% colnames() %>% as.vector()
If you use filter
, you can create a symbol from you string, then defuse with {{
, so:
ord_table <- function (x) {
table_ord <- data.frame(Median = double(length = 1),
Minimum = double(length = 1),
Maximum = double(length = 1),
IQR = double(length = 1)
)
x <- sym(x)
table_ord$Median <- data %>% filter({{x}} > 0) %>% pull(x) %>% median()
table_ord$Minimum <- data %>% filter({{x}} > 0) %>% pull(x) %>% min()
table_ord$Maximum <- data %>% filter({{x}} > 0) %>% pull(x) %>% max()
table_ord$IQR <- data %>% filter({{x}} > 0) %>% pull(x) %>% IQR()
return(table_ord)
}
But I would probably do something like this, and actually set the values as NA
(ideally you would do this when you read in the data):
library(tidyverse)
data |>
mutate(across(test_vars, \(v) na_if(v, -99999))) |>
pivot_longer(cols = test_vars) |>
drop_na() |>
summarise(
mean = mean(value), min = min(value), max = max(value), IQR = IQR(value),
.by = name
)
# A tibble: 4 × 5 name mean min max IQR <chr> <dbl> <dbl> <dbl> <dbl> 1 var3 3.2 1 5 2 2 var4 2.58 1 5 1.5 3 var2 2.44 1 5 2 4 var1 3.33 1 5 1