Search code examples
r

How to create some of the exotic types?


I am intrigued by ?typeof, which mentions values that can be returned. Is there a way to call typeof(something) and get one of the following?

"promise", "char", "...", "any", "bytecode"

I discovered I can get two of the more exotic types that the help for typeof considers "unlikely to be seen at the user level", like so:

typeof(new("externalptr"))
# [1] "externalptr"

typeof(rlang::new_weakref(new("externalptr")))
# [1] "weakref"

but is there a way to get the others?


Solution

  • Before we try to get specific responses out of typeof, let's clarify what the function actually does. This requires a refresher on what a type is in R.

    Types

    Every R object is represented by a structure in the underlying C code called an SEXP, which contains a pointer to the actual data. Since there are different types of data structure that the SEXP could point to, each SEXP has a field called SEXPTYPE that tells R what sort of structure the SEXP is pointing at. This SEXPTYPE is stored as an integer.

    When we call typeof in R, the integer value of the object's SEXTYPE is looked up in the type table, which ultimately returns a string to the console to give a human-readable description of the SEXPTYPE of the object. The type table therefore contains all possible outputs of typeof.

    In this sense, the type of an object in R is the lowest-level description of what sort of object it is.

    The SEXPTYPE values with entries in the type table are as follows:

    value SEXPTYPE Description typeof output
    0 NILSXP NULL "NULL"
    1 SYMSXP symbols "symbol"
    2 LISTSXP pairlists "pairlist"
    3 CLOSXP closures "closure"
    4 ENVSXP environments "environment"
    5 PROMSXP promises "promise"
    6 LANGSXP language objects "language"
    7 SPECIALSXP special functions "special"
    8 BUILTINSXP builtin functions "builtin"
    9 CHARSXP internal character strings "char"
    10 LGLSXP logical vectors "logical"
    13 INTSXP integer vectors "integer"
    14 REALSXP numeric vectors "double"
    15 CPLXSXP complex vectors "complex"
    16 STRSXP character vectors "character"
    17 DOTSXP dot-dot-dot object "..."
    18 ANYSXP make “any” args work "any"
    19 VECSXP list (generic vector) "list"
    20 EXPRSXP expression vector "expression"
    21 BCODESXP byte code "bytecode"
    22 EXTPTRSXP external pointer "externalptr"
    23 WEAKREFSXP weak reference "weakref"
    24 RAWSXP raw vector "raw"
    25 S4SXP S4 classes not of simple type "S4"

    It is possible to get an object of each of these types in the console, but as far as I can tell, three of them cannot be obtained in base R alone. These are "char", "any" and "weakref". For these we need to use extra compiled code - either our own little snippets in Rcpp, or already-available functions in rlang.

    Let's get an example of each valid type in the console.

    0: NILSXP

    This is just NULL

    n <- NULL
    typeof(n)
    #> [1] "NULL"
    

    1: SYMSXP

    This is an unevaluated symbol. We can get a symbol in several ways in base R, including quote, substitute, bquote and str2lang

    s <- quote(x)
    typeof(s)
    #> [1] "symbol"
    

    2: LISTSXP

    Despite the name, this is not used for list objects, but rather for dotted pairlists, as used in the formals of functions. Functionally, they are similar to standard lists, but are implemented differently in the underlying C code, and do have some important differences

    p <- pairlist(a = 1)
    typeof(p)
    #> [1] "pairlist"
    

    3: CLOSXP

    This is used to store closures, i.e. functions that are written in R code rather than being internal C functions.

    f <- function() {}
    typeof(f)
    #> [1] "closure"
    

    4: ENVSXP

    Used to store environments

    e <- new.env()
    typeof(e)
    #> [1] "environment"
    

    5: PROMSXP

    In R, a promise is made of two objects: a chunk of unevaluated code, plus a pointer to an environment in which that code should be evaluated. This is very similar to a quosure in the tidyverse ecosystem, except that one can assign and pass round a quosure quite easily, delaying evaluation until it is required. A promise is more evanescent; it will evaluate as soon as you assign it to a symbol, so to see one in the wild you need to have it contained in a list or assigned to a variable via delayedAssign

    Creating one in base R is tricky, but it can be achieved by hijacking the complex assignment mechanism. This is where one creates a function like `foo<-` <- function(x, value). The interpreter will allow you to call this function as foo(x) <- value, but in doing so converts value to a promise in place of an unevaluated code chunk. This allows us to capture the promise using match.call():

    `f<-` <- function(x, value) {
      list(match.call()$value)
    }
    
    x <- 1
    f(x) <- "foo"
    
    p <- delayedAssign("p", x[[1]])
    
    p
    #> <promise: 0x000002472411f410>
    
    typeof(p)
    #> [1] "promise"
    

    6: LANGSXP

    This is just an unevaluated chunk of code (though it has been parsed as syntactically correct before being stored). Again, this can be created by quote or substitute, but formulas are also stored as language objects:

    l <- hello ~ world
    typeof(l)
    #> [1] "language"
    

    7: SPECIALSXP

    This is only used for primitive functions which pass their arguments unevaluated to the internal R machinery:

    i <- `if`
    typeof(i)
    #> [1] "special"
    

    8: BUILTINSXP

    Again, this is only used to store the built-in functions, but these differ from "special" functions, in that their arguments are evaluated in R before being passed to the internal code.

    b <- `+`
    typeof(b)
    #> [1] "builtin"
    

    9: CHARSXP

    These are not used to store R's familiar character vectors, but instead are a character type used internally by R to store atomic character strings. This allows a cache of reusable strings, and allows character vectors (type STRSXP) to be more efficient. Note that R does not like dealing with CHARSXP outside of its internal functions. It will give warnings when you have one in the console, telling you that this type of object cannot have attributes.

    Counter-intuitively, this is one of the hardest to make. Perhaps the easiest way is to create a RAWSXP then change the underlying type in compiled code.

    Rcpp::cppFunction("SEXP mkchar(SEXP s) {SET_TYPEOF(s, 9); return s;}")
    
    get_char <- function(){
      mkchar(as.raw(c(0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72, 
               0x6c, 0x64, 0x21, 0x00)))
    }
    
    chr <- get_char()
    
    chr
    #> <CHARSXP: "Hello World!">
    
    typeof(chr)
    #> [1] "char"
    

    10: LGLSXP

    These are the commonly-used logical vectors in R.

    l <- TRUE
    typeof(l)
    #> [1] "logical"
    

    If you're wondering about SEXPTYPE 11 and 12, these were previously used decades ago for factors and ordered factors, but are not even defined any more, so there are no hidden legacy types we can pull out of typeof corresponding to these.


    13: INTSXP

    R uses different types for integers and double-precision floating point numbers, but will convert from integers to doubles with the slightest provocation. The difference between integers and doubles is to some extent abstracted away from the end-user by the fact that INTSXP and REALSXP are lumped together as the mode "numeric"

    I <- 1L
    typeof(I)
    #> [1] "integer"
    

    14: REALSXP

    These are the familiar numeric vectors

    r <- 1.1
    typeof(r)
    #> [1] "double"
    

    15: CPLXSXP

    Complex numbers have their own storage type and are easy to create. I'm not sure why they need their own storage type, as it seems they could just as easily be implemented in S3. Presumably this is partly historical and partly due to efficiency in interfacing with various math libraries.

    C <- 1 + 1i
    typeof(C)
    #> [1] "complex"
    

    16: STRSXP

    This is the familiar character vector.

    s <- "Hello world"
    typeof(s)
    #> [1] "character"
    

    17: DOTSXP

    The dots in some function formals that allow for extra arbitrary arguments to be passed are implemented as a pairlist of promises. This has its own storage mode called DOTSXP

    Perhaps surprisingly, this can actually be obtained without any compiled code:

    d <- (function(...) get("..."))(a = 1)
    
    d
    #> <...>
    
    typeof(d)
    #> [1] "..."
    

    18: ANYSXP

    This isn't really a well-defined storage mode. As far as I can tell, it is used as a stand-in internally, predominantly for the implementation of S4 objects. The console will give you an error if you try to display an object of type "any", but it can be stored and its type reported correctly. I can't see a way to obtain it without just coercing an existing object via compiled code:

    Rcpp::cppFunction("SEXP get_any(SEXP s) {SET_TYPEOF(s, 18); return s;}")
    
    a <- get_any(1:5)
    
    a
    #> Error: unimplemented type 'any' in 'PrintValueRec'
    
    typeof(a)
    #> [1] "any"
    

    19: VECSXP

    This is the familiar all-purpose R list

    l <- list()
    typeof(l)
    #> [1] "list"
    

    20: EXPRSXP

    Used for (lists of) unevaluated expressions

    e <- expression(hello * world)
    typeof(e)
    #> [1] "expression"
    

    21: BCODESXP

    Used for byte code of compiled functions.

    b <- .Internal(bodyCode(mean))
    typeof(b)
    #> [1] "bytecode"
    

    22: EXTPTRSXP

    This was already mentioned in the question and is here for completeness

    e <- new("externalptr")
    typeof(e)
    # [1] "externalptr"
    

    23: WEAKREFSXP

    This was already mentioned in the question and is here for completeness

    w <- rlang::new_weakref(.GlobalEnv)
    typeof(w)
    #> [1] "weakref"
    

    24: RAWSXP

    This is just an array of unsigned 8-bit integers

    r <- as.raw(1L)
    typeof(r)
    #> [1] "raw"
    

    25: S4SXP

    This is used for objects made in the native object-oriented S4 system

    setClass("R_obj", slots = c(a = "character", b = "numeric"))
    s <- new("R_obj", a = "Hello world", b = 1)
    typeof(s)
    #> [1] "S4"
    

    In addition to these 24 types, there are 3 other SEXPTYPEs defined which do not have names in the type lookup table and therefore can't return unique names from typeof. These are 30 (NEWSXP), 31 (FREESXP) and 99 (FUNSXP). The first two are used internally for memory management / garbage collection, and should only ever exist for microseconds, and the third is used as a placeholder SEXPTYPE for lumping together closures / builtin functions / special functions when searching for objects of mode function. As far as I can tell, no SEXP ever actually has this SEXTYPE.

    I'd be interested to hear whether anyone who has a way of creating a WEAKREFSXP without using rlang / Rcpp. It would also be good to hear about any ways of creating a CHARSXP or ANYSXP without using compiled code (though these seem to be a bit unstable when used in the console however they are produced).


    As a final note, the closely related concepts of mode, storage mode and class come up when talking about types. Both mode and storage.mode are essentially aliases for type, as described here:

    Storage mode

    The call storage.mode(x) simply calls typeof(x) and returns it, unless typeof(x) is "closure", "builtin" or "special", then storage.mode returns "function". It is therefore just a slight abstraction / simplification of type.

    Mode

    The call mode(x) also calls typeof(x) and simplifies closures / specials / builtins into the single mode "function". In addition, it returns "numeric" for both integer and real number types. It changes "symbol" to "name", and changes "language" to either "call" or "(", depending on whether the language object starts with a parenthesis.

    This diagram gives the full mapping between type, storage mode and mode: enter image description here

    Class

    You can use R every day without needing to know anything about type, mode and storage mode, but a competent R user needs to know the concept of class. It is an object's class that determines which methods are dispatched on calls to generic functions, and therefore it is class that controls an object's behaviour.

    You can set an object's class simply by setting its class attribute:

    x <- 1
    class(x) <- "foo"
    class(x)
    #> [1] "foo"
    

    However, every object in R has a class, even objects with no class attribute:

    x <- 1:5
    class(x)
    #> [1] integer
    

    This is because R determines an object's class via the C function do_data_class. If there is a "class" attribute set, then that is the class. If there is no "class" attribute set, then first R will check for a dimension attribute. If there is a non-zero dimension attribute, then the class will be an "array" (though if it has exactly two dimensions it will have the class c("matrix", "array")). If there is no class or dimension attribute then the type of the object is retrieved. Depending on the type, R will return:

    • "function" for closures, builtin or specials
    • "numeric" for REALSXP (though surprisingly not for INTSXP)
    • "name" for symbols
    • In the case of language objects, if the first symbol is if, while, for, =, <-, ( or {, then this will be returned. Otherwise "call" is returned. This is quite an arcane system, which seems to be a way of handling different elements of the language's grammar.
    • In all other cases, the class will be the typeof the object.

    In summary, type is the actual type of the objects stored in memory, and mode is a partial abstraction of the actual type that gives us the familiar names we often think of as "basic types" in R. Storage mode is a close synonym for type with limited usefulness. Class is the most familiar and useful abstraction of data types in R, and if an object doesn't have a specified class, R will assign it an implicit class based on the above rules.