Why the below two codes return different results, or what is the use of "-" before the "date" variable column in this code?
bikes <- read_csv("https://raw.githubusercontent.com/whipson/Ottawa_Bicycles/master/bikes_app.csv", col_types = c("?nnnnnnnnnnnnnn"))
coords <- read_csv("https://raw.githubusercontent.com/whipson/Ottawa_Bicycles/master/coords.csv")
the result of the below code with or without "-" before the date variable is different, So I would like to know the use of "-" in the code before the "date"?
bikes_plot <- bikes %>%
pivot_longer(names_to = "counter", values_to = "count", -date) %>%
left_join(coords, by = "counter")
bikes_plot <- bikes %>%
pivot_longer(names_to = "counter", values_to = "count", date) %>%
left_join(coords, by = "counter")
I have daunt why the two results of the code in R console are different?
in pivot_longer(names_to = "counter", values_to = "count", -date)
,
I also have a daunt if "values_to = "count"
, count is an existing column name here in the csv file, or it is a summary.
In R, when you name your arguments, they are always assigned to the respective name within the formal definition. When arguments are not named, they are filled from left to right in the arg list (before ...
) from those arguments that have not had a named-argument assigned.
If you look at the arguments of pivot_longer
:
pivot_longer(
data,
cols,
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = NULL,
names_transform = NULL,
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = NULL,
values_transform = NULL,
...
)
and take your call:
bikes %>%
pivot_longer(names_to = "counter", values_to = "count", -date)
The "current data" in the %>%
-pipe is assigned to the first argument, which is data
. The next unnamed argument -date
(or date
in the other block) is assigned to cols
, which are
cols: <'tidy-select'> Columns to pivot into longer format.
The presence of the leading -
means "all columns except date
", in the tidy-select way of selecting columns. Other options could be everything()
, ends_with("bar")
, etc.
When you say -date
, you are saying that you want the date
column to remain unchanged: it remains as a column, and in this case you will get all remaining columns pivoted/reshaped into just two columns: one column "count"
which holds the other values, and "counter"
that includes the column name from which the count
was retrieved. Because we're selecting all columns except one for pivoting (and we aren't doing any other fancy pivot-by-pattern), then we should see (ncol(bikes)-1) * nrow(bikes)
rows resulting.
bikes %>%
pivot_longer(names_to = "counter", values_to = "count", -date)
# # A tibble: 49,840 × 3
# date counter count
# <dttm> <chr> <dbl>
# 1 2010-01-01 00:00:00 alexandra_bridge 0
# 2 2010-01-01 00:00:00 eastern_canal 0
# 3 2010-01-01 00:00:00 ottawa_river 0
# 4 2010-01-01 00:00:00 western_canal NA
# 5 2010-01-01 00:00:00 laurier_bay NA
# 6 2010-01-01 00:00:00 laurier_lyon NA
# 7 2010-01-01 00:00:00 laurier_metcalfe NA
# 8 2010-01-01 00:00:00 somerset_bridge NA
# 9 2010-01-01 00:00:00 otrain_young NA
# 10 2010-01-01 00:00:00 otrain_gladstone NA
# # … with 49,830 more rows
# # ℹ Use `print(n = ...)` to see more rows
When you instead use date
(not -date
), you are saying to pivot only that column ... which seems kind of silly, since all it does it remove the date
column, add one column named "counter"
that contains invariably "date"
(the name of the column pivoted), and one column named "count"
that is effectively (in this case) exactly the same as the original date
column. The remainder of the columns are not touched, and since the number of columns we pivot is only 1, the number of rows in the frame is unchanged.