Search code examples
rwindow-functions

How to use window functions?


I'm struggling to get window functions working in R to rank groups by the number of rows.

Here's my sample code:

data <- read_csv("https://dq-content.s3.amazonaws.com/498/book_reviews.csv")

data %>% 
    group_by(state) %>%
    mutate(num_books = n(),
           state_rank = dense_rank(num_books)) %>% 
    arrange(num_books)

The expected output is that the original data will have a new column that tells me the rank for each row (book, state, price and review) depending on whether that row is for a state with the most book reviews (would have state_rank of 1); second most books (rank 2), etc.

Manually I can get the output like this:

data <- read_csv("https://dq-content.s3.amazonaws.com/498/book_reviews.csv")

manual_ranks <- data %>% 
  count(state) %>% 
  mutate(state_rank = rank(1/n))

desired_output <- data %>% 
  left_join(manual_ranks) %>% 
  arrange(state_rank)

sample_n(desired_output, 10)

In other words, I want the last column of this table:

 data %>% 
 count(state) %>% 
 mutate(state_rank = rank(1/n)) 

added to each row of the original table (without having to create this table and then using left_join by state; that's the point of window functions).

Anyway, with the original code, you'll see that all state_rank just say 1, when I would expect states with the most book reviews to be ranked 1, second most reviews would have 2, etc.

My goal is to then be able to filter by, say, state_rank > 4. That is, I want to keep all the rows in the original data for top 4 states with the most book reviews.


Solution

  • I don't quite get why you would like to rank on number of reviews per state without also grouping by title and perhaps giving some weight to price, but by your original description:

    library(tidyverse)
    data <- read_csv("https://dq-content.s3.amazonaws.com/498/book_reviews.csv")
    
    data_ranked <- data %>%
        group_by(state) %>% 
        mutate(reviews = sum(!is.na(review) == TRUE)) %>% 
        ungroup() %>% 
        mutate(state_rank = dense_rank(1/reviews), .keep = "unused")
    

    This will let you filter your data by states with the most or fewest total reviews. Notice that I drop the column reviews, but if you want to keep it, just remove the .keep = "unused".

    edit: I originally forgot to reverse the rank so that the state with the most reviews have the highest rank (as the rank function ranks the numerical values from low to high). Making it a fraction of 1 is the simplest way I could think of.

    Also, counting the number of reviews per group exploits the fact that a logical TRUE equals 1 in R. I assume you only want to count the reviews that does NOT have a NA value. I would definitely want to weigh the reviews by converting them to a scale - so that a book with only poor reviews is not a good thing on a state ranking.