There is a lovely chunk of code in TidyText Mining Section 3.3 that I am trying to replicate in my own dataset. However, in my data, I cannot get ggplot to 'remember' that I want the data in descending order, and that I want a certain top_n
.
I can run the code from TidyText Mining and I get the same charts that the book shows. However, when I run this on my own dataset, the facet wraps do not show the top_n (they seem to show a random number of categories) and data within each facet is not sorted by descending order.
I can replicate this problem with some random text data and the full code - but I also can replicate the problem with mtcars
- which really confuses me.
I expect the following chart to show me mpg in descending order for each facet, and for each facet to only give me the top 1 category. It does neither for me.
require(tidyverse)
mtcars %>%
arrange (desc(mpg)) %>%
mutate (gear = factor(gear, levels = rev(unique(gear)))) %>%
group_by(am) %>%
top_n(1) %>%
ungroup %>%
ggplot (aes (gear, mpg, fill = am)) +
geom_col (show.legend = FALSE) +
labs (x = NULL, y = "mpg") +
facet_wrap(~am, ncol = 2, scales = "free") +
coord_flip()
But what I really want is to have a chart like this sorted as in the TidyText book (data for example only).
require(tidyverse)
require(tidytext)
starwars <- tibble (film = c("ANH", "ESB", "ROJ"),
text = c("It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet. Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy.....",
"It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy. Evading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth. The evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space....",
"Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt. Little does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star. When completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy...")) %>%
unnest_tokens(word, text) %>%
mutate(film = as.factor(film)) %>%
count(film, word, sort = TRUE) %>%
ungroup()
total_wars <- starwars %>%
group_by(film) %>%
summarize(total = sum(n))
starwars <- left_join(starwars, total_wars)
starwars <- starwars %>%
bind_tf_idf(word, film, n)
starwars %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(film) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = film)) +
geom_col(show.legend = FALSE) +
labs (x = NULL, y = "tf-idf") +
facet_wrap(~film, ncol = 2, scales = "free") +
coord_flip()
I believe what is tripping you up here is that top_n()
defaults to the last variable in the table, unless you tell it what variable to use for ordering. In the examples in our book, the last variable in the dataframe is tf_idf
so that is what is used for ordering. In the mtcars example, top_n()
is using the last column in the dataframe for ordering; that happens to be carb
.
You can always tell top_n()
what variable you want to use for ordering by passing it as an argument. For example, check out this similar workflow using the diamonds dataset.
library(tidyverse)
diamonds %>%
arrange(desc(price)) %>%
group_by(clarity) %>%
top_n(10, price) %>%
ungroup %>%
ggplot(aes(cut, price, fill = clarity)) +
geom_col(show.legend = FALSE, ) +
facet_wrap(~clarity, scales = "free") +
scale_x_discrete(drop=FALSE) +
coord_flip()
Created on 2018-05-17 by the reprex package (v0.2.0).
These example datasets are not perfect parallels because they don't have one row per combination of characteristics in the way that the tidy text data frames do. I am pretty sure the issue with top_n()
is the problem, though.