Search code examples
rggplot2tidytext

Gather function in R dropping column


I'm comparing the language used by some authors with data downloaded from the Project Gutenberg site but I'm having some trouble with my tibble manipulation. My end goal is to make a plot comparing frequency of word usage by Herman Melville and Lewis Carroll compared to Washington Irving. However, my tibble doesn't have an Irving column which is problematic when I then attempt to call it in my ggplot.

I'm expecting my frequency tibble to look like

# A tibble: 72,984 x 4
   word             Irving     author     proportion
   <chr>             <dbl>     <chr>        <dbl>
1 a'dale          0.00000907   Melville   NA        
 2 aa             NA           Melville   0.0000246
 3 ab             NA           Melville   NA        
 4 aback          NA           Melville   0.0000369
 5 abana          NA           Melville   0.0000123
 6 abandon        0.0000363    Melville   0.0000861
 7 abandoned      0.000163     Melville   0.000172 
 8 abandoning     0.0000181    Melville   NA        
 9 abandonment    0.00000907   Melville   0.0000123
10 abasement      0.0000181    Melville   0.0000123
# ... with 72,974 more rows

but instead it looks like

# A tibble: 72,984 x 3
   word        author   proportion
   <chr>       <chr>         <dbl>
 1 a'dale      Melville NA        
 2 aa          Melville  0.0000246
 3 ab          Melville NA        
 4 aback       Melville  0.0000369
 5 abana       Melville  0.0000123
 6 abandon     Melville  0.0000861
 7 abandoned   Melville  0.000172 
 8 abandoning  Melville NA        
 9 abandonment Melville  0.0000123
10 abasement   Melville  0.0000123
# ... with 72,974 more rows

and I'm not sure what I'm doing wrong when I gather to make the frequency tibble.

Code

# Import libraries
library(tidyverse) # dplyr, tidyr, stringr, ggplot2
library(tidytext)
library(gutenbergr)

# Download four works from each author
wirving <- gutenberg_download(c(49872, 41, 14228, 13514)) 
hmelville <- gutenberg_download(c(15, 4045, 28656, 2694))
lcarroll <- gutenberg_download(c(19033, 620, 12, 4763))

# tidy each author
tidy_wirving <- wirving %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

tidy_hmelville <- hmelville %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

tidy_lcarroll <- lcarroll %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = "word")

# calculate word frequency
frequency_by_word_across_authors <- 
  bind_rows(mutate(tidy_wirving, author = "Irving"),
            mutate(tidy_hmelville, author = "Melville"),
            mutate(tidy_lcarroll, author = "Carroll")) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n /sum(n)) %>%
  select(-n) %>%
  spread(author, proportion)

# compare frequency of Melville and Carroll against Irving
frequency <- frequency_by_word_across_authors %>%
  gather(author, proportion,`Melville`:`Carroll`)

ggplot(frequency,
       aes(x = proportion,
           y =`Irving`,
           color = abs(`Irving`- proportion))) +
  geom_abline(color = "gray40", 
              lty = 2) +
  geom_jitter(alpha = 0.1, 
              size = 2.5,
              width = 0.3, 
              height = 0.3) +
  geom_text(aes(label = word),
            check_overlap = TRUE, 
            vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001),
                       low = "darkslategray4",
                       high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Irving Washington", x = NULL)

# Error in FUN(X[[i]], ...) : object 'Irving' not found

Solution

  • The issue is how you are using gather(); the two columns that you want to gather are not next to each other so you don't want to use ::

    frequency <- frequency_by_word_across_authors %>%
      gather(author, proportion, Carroll, Melville)
    
    
    ggplot(frequency,
           aes(x = proportion,
               y = Irving,
               color = abs(Irving - proportion))) +
      geom_abline(color = "gray40", 
                  lty = 2) +
      geom_jitter(alpha = 0.1, 
                  size = 2.5,
                  width = 0.3, 
                  height = 0.3) +
      geom_text(aes(label = word),
                check_overlap = TRUE, 
                vjust = 1.5) +
      scale_x_log10(labels = percent_format()) +
      scale_y_log10(labels = percent_format()) +
      scale_color_gradient(limits = c(0, 0.001),
                           low = "darkslategray4",
                           high = "gray75") +
      facet_wrap(~author, ncol = 2) +
      theme(legend.position="none") +
      labs(y = "Irving Washington", x = NULL)
    

    Created on 2019-11-01 by the reprex package (v0.3.0)