Search code examples
rvectordataframer-factor

Getting Values of Specific Elements of a data frame in R


I have a very simple code, I do not understand why not working the way I want. Basically, I have a data frame and want to capture the value of n'th element of a column in the data frame, and store it in a vector. Here is my code:

COL1_VALUES <- c("ABC","XYZ","PQR")
COL2_VALUES <- c("DEF","JKL","TSM")

means <- data.frame(COL1_VALUES,COL2_VALUES)

for (i in 1:nrow(means)) {
    COL1_VALUES[i] <- means$COL1[i];
    COL2_VALUES[i] <- means$COL2[i];
}

print(means$COL1)
print(COL1_VALUES)

This outputs:

[1] ABC XYZ PQR
Levels: ABC PQR XYZ
[1] "1" "3" "2"

Why not am I not getting ABC XYZ TSM in the vector COL1_VALUES? It appears like 1, 3, 2 are the indices of ABC XYZ TSM in means$COL1. What do I need to get ABC XYZ TSM in the vector COL1_VALUES?

Thanks.


Solution

  • In R, data.frame() function ships with a default setting of stringsAsFactors=TRUE. This means that all input character vectors are implicitly converted into so called "factors" when creating a data.frame.

    factor is somewhat like a vector with integers + a text labels that describe those integers. For example, if column gender has a type factor it is actually a vector of integers with 1s and 2s plus an attached dictionary that category id 1 means Male and category id 2 means Female or vice versa.

    This default setting on stringsAsFactors is a sneaky beast and can show up in numerous unexpected locations. In most of these cases, it helps just to add an explicit stringsAsFactors=FALSE option so as to keep character vectors as character vectors.

    Below I list the functions that I personally struggled with until realising that all I am missing is stringsAsFactors=FALSE option:

    • data.frame
    • read.csv, read.table and other read.* functions
    • expand.grid

    In your specific example above, what you need to do is find this line:

    means <- data.frame(COL1_VALUES,COL2_VALUES)
    

    and replace it with:

    means <- data.frame(COL1_VALUES,COL2_VALUES,
                         stringsAsFactors=FALSE)
    

    such that you are explicitly requesting data.frame() not to do any implicit conversions behind your back.

    You can also avoid this conversion by changing the global option at the beginning of each R session:

    options(stringsAsFactors = FALSE)
    

    Note, however, that modifying this global option only affects your machine and snippets of your code may stop working on the machines of others.

    This answer contains more information about how to disable it permanently.