Search code examples
rsortingggplot2filteringggmap

Get the most repeated items in one column and its associated value in another column


My original dataset(cyclistic_data) has 3 columns: station name, station_lat and station_lng. This is the preview of my data set.
Preview of data set

station_coordinates <- 
   data.frame(
     station = c(Dearborn St, Franklin St, Lake Shore Dr, Dearborn St, 
                 Dearborn St,  Franklin St),
     station_lat = c(-87.6291273333333, -87.6353428333333, 
                     41.8809828333333, -87.6291273333333, 
                    -87.6291273333333, -87.6353428333333),
     station_lng = c(41.8944341666667, 41.900675, 41.8720545, 
                     41.8944341666667, 41.8944341666667, 41.900675) 
   )              

I tried

sort(table(cyclistic_data$start_station),decreasing=TRUE)[1:100]

code to get the most repeated 100 station names. By using this code I identified top 100 stations but I was unable to extract the corresponding lan and lng values from the original data det.I want retrieve the most repeated 100 stations name and its lat and lng values from the original data set.


Solution

  • It's a bit difficult to give an exact solution without a reproducible example, but if I understand you correctly, you could try:

    • Adding a column to your data frame with add_count(), we'll name it "station_n" here.
    • slice_max(n = 100, order_by = station_n) this takes the top 100 stations according to our count.
    data %%>
        add_count(station, name = "station_n")%>%
        slice_max(n = 100, order_by = station_n) 
    

    Plotting according to long and latitude is then a slightly different issue, and will be dependent on how you want to plot, but you can look at the maps() package and geom_sf() as starters

    After your comment, the simplest way I know to do this uses joins. There may well be a better/quicker option:

    #Making a minimally complex example: 
    data <- data.frame(station = c("one", "one", "two", "one", "three", "three"),
                         lat = c(18938, 18938, 18273, 18938, 199999, 199999))
    
    top <- data %>%
      count(station, sort = TRUE)%>%
      slice_max(n = 2, order_by = n)
    
    data %>%
      inner_join(top)
    
    

    And we could turn that into a function like so:

    keeping_top_n <- function(x, top_n = 100, var_of_interest = station){
      top <- x %>%
        count({{var_of_interest}}, sort = TRUE)%>%
        slice_max(n = top_n, order_by = n)
      
      x %>%
        inner_join(top)
    }
    
    #you would call this like:
    
    keeping_top_n(data_frame)