First time asking a question here and very new to R, so please bear with me if I format/phrase something incorrectly. Thanks in advance for your help!
I have multiple columns stored as factors in my dataframe. However, after running a code chunk of dplyr functions with pipes, the tibble output is not retaining the factor levels from the dataframe. The data is out of order.
Here are the factor levels (I eliminated some of the preceding code because it's long):
...
POST_600_WOBA_GRADE = factor(POST_600_WOBA_GRADE, levels = c("Elite", "Great", "Good", "Average", "Below Average", "Poor", "Awful", "DNQ")))
When I run this to confirm, the correct levels show:
levels(all_batter_career_data_by_ID$POST_600_WOBA_GRADE)
[1] "Elite" "Great" "Good" "Average" "Below Average"
[6] "Poor" "Awful" "DNQ"
However, when I run this below, DNQ and Awful get flipped. DNQ (Did Not Qualify) should be on the far right.
all_batter_career_data_by_ID %>%
filter(BAT_DEBUT < "2015-01-01") %>%
group_by(FIRST_600_WOBA_GRADE, POST_600_WOBA_GRADE) %>%
summarize(n = n()) %>%
pivot_wider(names_from = POST_600_WOBA_GRADE, values_from = n)
# A tibble: 8 × 9
# Groups: FIRST_600_WOBA_GRADE [8]
FIRST_600_WOBA_GRADE Elite Great Good Average `Below Average` Poor DNQ Awful
<fct> <int> <int> <int> <int> <int> <int> <int> <int>
1 Elite 1 1 4 1 NA NA NA NA
2 Great 5 9 14 7 3 1 8 NA
3 Good 1 21 62 56 20 12 44 1
4 Average 1 10 45 77 29 29 78 6
5 Below Average NA 5 24 36 14 26 56 7
6 Poor NA 5 22 35 14 33 94 20
7 Awful NA 1 7 21 9 38 132 22
8 DNQ NA NA NA NA NA NA 1126 NA
I am running the exact same code for another variable with the same factor levels, and the tibble is in the correct order, so I'm not sure why this one is flipped. Thanks again for your help!
To understand what is going on, here's some sample data:
set.seed(0)
df <- data.frame(
"first_grade" = "Elite",
"second_grade" = c("Elite", "Great", "Good", "Average", "Below Average", "Poor", "Awful", "DNQ") |> factor(levels = c("Elite", "Great", "Good", "Average", "Below Average", "Poor", "Awful", "DNQ")),
"n" = sample(1:100, 8, replace = TRUE))
first_grade second_grade n
1 Elite Elite 14
2 Elite Great 68
3 Elite Good 39
4 Elite Average 1
5 Elite Below Average 34
6 Elite Poor 87
7 Elite Awful 43
8 Elite DNQ 14
Pivoting this data wider gives us the right answer, but once we randomly sort the rows, we get the wrong answer:
df <- df[sample(1:8, 8),]
tidyr::pivot_wider(df, names_from = "second_grade", values_from = "n")
# A tibble: 1 × 9
first_grade Great Good Awful Elite `Below Average` Average DNQ Poor
<chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 Elite 68 39 43 14 34 1 14 87
As Darren has pointed out, the solution is to use the names_sort = TRUE
argument for pivot_wider()
(sorting the column you're wanting to pivot wider would work too):
tidyr::pivot_wider(df, names_from = "second_grade", values_from = "n", names_sort = TRUE)
Output:
# A tibble: 1 × 9
first_grade Elite Great Good Average `Below Average` Poor Awful DNQ
<chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 Elite 14 68 39 1 34 87 43 14