Search code examples
rdataframeggplot2subset

How can I subset the top 10 by frequency of a 63,000+ row dataset in R


I have a dataset of analysis in the form:

| Compound | Concentration |SampleID|

There are 200 sample IDs, with 9000 unique Compounds giving a df with 63,000 rows. (not every compound is found in every sample)

What I would like to do is take the ten most frequently occurring compounds and create a subset so I can graph their concentrations using a boxplot or similar

I've tried using the below, but this leads to an error, and also only filters the top ten (so they're all the same compound)

df %>% 
  arrange(desc(df$Concentration)) %>%
  slice(1:10, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x="Compound", y="Frequency")

My other thought would be

  arrange(desc(as.data.frame(table(df$Compound)))) %>%
  slice(1:50, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x="Compound", y="Frequency")

Neither work. I feel like I need to make a df with a list of the top 10, then filter my df to give dftop and then subset my df to just those elements

Can anyone help simplify this?


Solution

  • Create a vector of the top 10 compounds, then filter the dataframe based on this vector. Illustration using most frequent homeworlds in dplyr::starwars:

    library(dplyr)
    
    top10 <- starwars %>% 
      count(homeworld, sort = TRUE) %>%
      head(10) %>%
      pull(homeworld)
    
    starwars %>%
      filter(homeworld %in% top10)
    
    # A tibble: 48 × 14
       name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
       <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
     1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
     2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
     3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
     4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
     5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
     6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
     7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
     8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
     9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
    10 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
    # … with 38 more rows, 4 more variables: species <chr>, films <list>,
    #   vehicles <list>, starships <list>, and abbreviated variable names
    #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld