Search code examples
rstringdplyrsubset

Unusual Behaviour of colon operator : in R


2000:2017

The expected output is a vector of the sequence 2000 to 2017 with a step of 1.

Output: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

'2000':'2017'

However, when I type this command, it still gives me the same output.

Output: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Unable to understand how it is generating sequence from characters.

Edit 1:

Ultimately, I am trying to understand why the code below worked? How can X2007:X2011 can possibly work? The select function is from dplyr package.

R code

My data also has similar column names as mentioned in the image above but I do not have 'X' there. I just have years like 2007,2008 etc.

For me select(Division, State, 2007:2011) does not work.

Error:Can't subset columns that don't exist. x Locations 2007, 2008, 2009, 2010, and 2011 don't exist.

But this works select(Division, State, '2007':'2011').


Solution

  • If we check the more generic seq.default, it does changes the type from character to numeric for the from and to

    ...
    if (!missing(from) && !is.finite(if (is.character(from)) from <- as.numeric(from) else from)) 
            stop("'from' must be a finite number")
        if (!missing(to) && !is.finite(if (is.character(to)) to <- as.numeric(to) else to)) 
    ...
    

    Along on that lines, the documentation of ?: also says so

    For other arguments from:to is equivalent to seq(from, to), and generates a sequence from from to to in steps of 1 or -1. Value to will be included if it differs from from by an integer up to a numeric fuzz of about 1e-7. Non-numeric arguments are coerced internally (hence without dispatching methods) to numeric—complex values will have their imaginary parts discarded with a warning.


    Regarding the updated question with subset and select, if the column is numeric column name i.e. it starts with digit, it is an non-standard column name and evaluation of those can be done by backquoting

    df1 <- data.frame(`2007` = 1:5, `2008` = 6:10, 
          `2012` =  11:15, v1 = rnorm(5), check.names = FALSE)
    subset(df1, select = `2007`:`2012`)
    #  2007 2008 2012
    #1    1    6   11
    #2    2    7   12
    #3    3    8   13
    #4    4    9   14
    #5    5   10   15
    

    Or with dplyr::select

    library(dplyr)
    select(df1, `2007`:`2012`)
    #   2007 2008 2012
    #1    1    6   11
    #2    2    7   12
    #3    3    8   13
    #4    4    9   14
    #5    5   10   15
    

    If we have X at the beginning (happens when we read the data without check.names = FALSE - by default it is TRUE. Or when we create the dataset with data.frame - here also the check.names = TRUE by default)

    df1 <- data.frame(`2007` = 1:5, `2008` = 6:10, `2012` =  11:15, v1 = rnorm(5))
    subset(df1, select = X2007:X2012)