Search code examples
rdplyrfilterpipedata-mining

What is the use of "-" before the date variable column?


Question

Why the below two codes return different results, or what is the use of "-" before the "date" variable column in this code?

bikes <- read_csv("https://raw.githubusercontent.com/whipson/Ottawa_Bicycles/master/bikes_app.csv", col_types = c("?nnnnnnnnnnnnnn"))
coords <- read_csv("https://raw.githubusercontent.com/whipson/Ottawa_Bicycles/master/coords.csv")

the result of the below code with or without "-" before the date variable is different, So I would like to know the use of "-" in the code before the "date"?

bikes_plot <- bikes %>%
  pivot_longer(names_to = "counter", values_to = "count", -date) %>%
  left_join(coords, by = "counter")

bikes_plot <- bikes %>%
  pivot_longer(names_to = "counter", values_to = "count", date) %>%
  left_join(coords, by = "counter")

I have daunt why the two results of the code in R console are different?

in pivot_longer(names_to = "counter", values_to = "count", -date), I also have a daunt if "values_to = "count", count is an existing column name here in the csv file, or it is a summary.


Solution

  • In R, when you name your arguments, they are always assigned to the respective name within the formal definition. When arguments are not named, they are filled from left to right in the arg list (before ...) from those arguments that have not had a named-argument assigned.

    If you look at the arguments of pivot_longer:

         pivot_longer(
           data,
           cols,
           names_to = "name",
           names_prefix = NULL,
           names_sep = NULL,
           names_pattern = NULL,
           names_ptypes = NULL,
           names_transform = NULL,
           names_repair = "check_unique",
           values_to = "value",
           values_drop_na = FALSE,
           values_ptypes = NULL,
           values_transform = NULL,
           ...
         )
    

    and take your call:

    bikes %>%
      pivot_longer(names_to = "counter", values_to = "count", -date)
    

    The "current data" in the %>%-pipe is assigned to the first argument, which is data. The next unnamed argument -date (or date in the other block) is assigned to cols, which are

        cols: <'tidy-select'> Columns to pivot into longer format.
    

    The presence of the leading - means "all columns except date", in the tidy-select way of selecting columns. Other options could be everything(), ends_with("bar"), etc.

    When you say -date, you are saying that you want the date column to remain unchanged: it remains as a column, and in this case you will get all remaining columns pivoted/reshaped into just two columns: one column "count" which holds the other values, and "counter" that includes the column name from which the count was retrieved. Because we're selecting all columns except one for pivoting (and we aren't doing any other fancy pivot-by-pattern), then we should see (ncol(bikes)-1) * nrow(bikes) rows resulting.

    bikes %>%
      pivot_longer(names_to = "counter", values_to = "count", -date)
    # # A tibble: 49,840 × 3
    #    date                counter          count
    #    <dttm>              <chr>            <dbl>
    #  1 2010-01-01 00:00:00 alexandra_bridge     0
    #  2 2010-01-01 00:00:00 eastern_canal        0
    #  3 2010-01-01 00:00:00 ottawa_river         0
    #  4 2010-01-01 00:00:00 western_canal       NA
    #  5 2010-01-01 00:00:00 laurier_bay         NA
    #  6 2010-01-01 00:00:00 laurier_lyon        NA
    #  7 2010-01-01 00:00:00 laurier_metcalfe    NA
    #  8 2010-01-01 00:00:00 somerset_bridge     NA
    #  9 2010-01-01 00:00:00 otrain_young        NA
    # 10 2010-01-01 00:00:00 otrain_gladstone    NA
    # # … with 49,830 more rows
    # # ℹ Use `print(n = ...)` to see more rows
    

    When you instead use date (not -date), you are saying to pivot only that column ... which seems kind of silly, since all it does it remove the date column, add one column named "counter" that contains invariably "date" (the name of the column pivoted), and one column named "count" that is effectively (in this case) exactly the same as the original date column. The remainder of the columns are not touched, and since the number of columns we pivot is only 1, the number of rows in the frame is unchanged.