Search code examples
rnumericcategorical-datar-factor

What is the R equivalent for Python's .cat.codes, which converts categorical variable to integer levels?


In python, you can generate a categorical code for a variable using .cat.code e.g.

df['col3'] = df['col3'].astype('category').cat.code

How do you do this in R ?


Solution

  • Fleshing this out a bit further for @Sid29:

    The python method function .cat.code extracts the numeric representation of the levels of a factor. The equivalent in R is:

    a <- factor(c("good", "bad", "good", "bad", "terrible"))
    
    as.numeric(a)
    [1] 2 1 2 1 3
    

    Note that .cat.code will represent NA (or NaN same thing) as -1 while the above solution in R still preservers NA and output will be simply NA.

    Edit: as.numeric(a) is better. There's discussion on the use of labels function inside as.numeric function. See the warning in ?factor:

    In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

    There are some anomalies associated with factors that have NA as a level. It is suggested to use them sparingly, e.g., only for tabulation purposes.

    If you have an NA value, it will coerce all values to NA, thus the reason for using labels. Interestingly, c(a) works (see @42 answer below).