Search code examples
rdplyrdata-sciencedata-analysis

Ranking of data that have the most data points in another column


I would like to look at the top 10 products that have the most corresponding data points by date. Since the quantity sold in a day is recorded under "soldUnits", there are no duplicate entries for an ArticleNr at one date. The maximum number in the example dataset would therefore be "365 obs. of 3 variables". How can I filter my dataset?

Edit: With the edited dataset given, I want to filter out Article Nr "1" because it has the most corresponding data in column "Date"

The problem in my real data is that there are around 2000 Products and I can't see what Article Nr has the most corresponding data in column "Date"

Edit2: As a MRE we can look at this Dataset

df <- data.frame(ArticleNr = c("1", "2", "3", "1", "2", "1"),
                created = as.Date(c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
                soldUnits = c(1, 1, 1, 1, 1, 1),
                stringsAsFactors=FALSE)

That leads to

   ArticleNr soldDate      soldUnits
     11      2020-01-01         1   
     22      2020-01-01         1   
     33      2020-01-01         1   
     11      2020-01-02         1   
     22      2020-01-02         1   
     11      2020-01-03         1

My desired result would be a ranking with n-ranks (Top 3, Top 10, Top 25)

In this Dataframe it would look like this

   Rank  ArticleNr  soldOnDates     
     1     11         3         #<-- ArticleNr 11 was sold on 3 out of 3 days, so it has Rank 1 
     2     22         2   
     3     33         1   

How can I achieve this on a big dataset with around 2000 products?


Solution

  • This is equivalent to @Pa_Syl answer that uses table(). As an added bonus, you can use your own column names instead of Var1 and Freq. The part after summarise() is needed to calculate the rank of each ArticleNr.

    df %>% group_by(ArticleNr) %>% summarise(SoldOnDate = n()) %>%
     ungroup() %>% arrange(-SoldOnDate) %>% mutate(rank = 1:n())