Search code examples
rdataframelookupnar-colnames

Accessing column by name in data.frame: return NA if column doesn't exists


Is there a more efficient way to check column existence by it's name, returning column if it exists or return na if it doesn't exists?

Now i'm using following function:

TryGetColumn <- function(x, column.name, column.names, value.if.not.exists) {
    if (column.name %in% column.names) {
        x[, column.name]
    } else {
        value.if.not.exists
    }
}

df <- data.frame(a = 1:5, b = 6:10)
col.names <- colnames(df)
ab <- TryGetColumn(df, "a", col.names, NA) + TryGetColumn(df, "b", col.names, NA)
ac <- TryGetColumn(df, "a", col.names, NA) + TryGetColumn(df, "c", col.names, NA)

ab
#[1]  7  9 11 13 15
ac
#[1] NA NA NA NA NA

EDIT: Based on Gregor's answer, I was rewrited the code following:

col.names <- c("a", "b", "c")
col.matches <- as.list(col.names %in% colnames(df))
names(col.matches) <- col.names

TryGetColumn <- function(col.name) if (col.matches[[col.name]]) df[, col.name] else NA

ab <- TryGetColumn("a") + TryGetColumn("b")
ac <- TryGetColumn("a") + TryGetColumn("c")

Now it based on associative array (list) and should be faster than linear lookup on every TryGetColumn call.


Solution

  • I'd use match.

    match(x = c("mpg", "disp", "blarg"), table = names(mtcars), nomatch = NA)
    # [1]  1  3 NA
    

    If you write a wrapper, you don't need to pass both a data frame and its columns separately:

    column_index <- function(data, column.names, value.if.not.exists = NA) {
        match(x = column.names, table = names(data), nomatch = value.if.not.exists)
    }
    

    Edits

    Oops, I thought you wanted the column indices. You want to return a single, whole column. For that, your function looks good, I'd just simplify it a little:

    TryGetColumn <- function(x, column.name, value.if.not.exists = NA) {
        if (column.name %in% names(x)) {
            return(x[, column.name])
        } 
        return(value.if.not.exists)
    }
    

    If you're using this a lot and really worried about computing time, it would be faster to check a bunch of column names at once (as in the top of my answer), and then only pull the columns that are there out of the data frame and make as many NA columns as you need. But, if you're concerned about computing time I'd wager a lot that this particular step isn't what's slowing you down, whatever you're doing.