Search code examples
rfunctionoperator-keyword

R, mapping items in a data frame


Total newb here. Please explain how on Earth does this line work, I understand the rest:

 gene_symbol <- id2symbol$gene_symbol[id2symbol$Ensembl == gene_id]

How does the ==, which as I know equals TRUE, work in this case? Or does it mean something else here? Thank you ever so much!

cancer_genes <- c("ENSG00000139618", "ENSG00000106462", "ENSG00000116288")

id2symbol <- data.frame(
  "Ensembl" = c("ENSG00000141510", "ENSG00000139618", "ENSG00000106462", "ENSG00000116288"),
  "gene_symbol" = c("TP53", "BRCA2", "EZH2", "PARK7")
)

gene_id_converter <- function(gene_id) {
  gene_symbol <- id2symbol$gene_symbol[id2symbol$Ensembl == gene_id]
  return(gene_symbol)
}

gene_id_converter(gene_id="ENSG00000141510")

Solution

  • With the function, we can either Vectorize or loop over the elements to get the value

    sapply(cancer_genes, gene_id_converter)
    

    -output

    ENSG00000139618 ENSG00000106462 ENSG00000116288 
            "BRCA2"          "EZH2"         "PARK7" 
    

    == is elementwise operator i.e. it should either have the lhs and rhs to be of same length or the rhs can be of length 1 which gets recycled. The output of == is a logical TRUE/FALSE which is used for subsetting the corresponding value from id2symbol$gene_symbol.

    Thus, if we provide more than one element to the function, there will be a length difference and it can get unexpected results due to recycling

    > id2symbol$Ensembl == cancer_genes[1]
    [1] FALSE  TRUE FALSE FALSE
    > id2symbol$Ensembl == cancer_genes
    [1] FALSE FALSE FALSE FALSE
    Warning message:
    In id2symbol$Ensembl == cancer_genes :
      longer object length is not a multiple of shorter object length
    

    Thus by looping over the cancer_genes, it would use the single element to recycle and gives back a logical TRUE/FALSE and get the corresponding id2symbol$gene_symbol where there are TRUE elements