Search code examples
rdplyrplyr

R - ddply(): Using min value of one column to find the corresponding value in different column


I want to get a summary of min(cost) per country over the years with the specific airport. The dataset looks like this (around 1000 rows with multiple airports per country)

airport  country cost    year
ORD      US      500     2010
SFO      US      800     2010
LHR      UK      250     2010
CDG      FR      300     2010
FRA      GR      200     2010
ORD      US      650     2011
SFO      US      500     2011
LHR      UK      850     2011
CDG      FR      350     2011
FRA      GR      150     2011
ORD      US      250     2012
SFO      US      650     2012
LHR      UK      350     2012
CDG      FR      450     2012
FRA      GR      100     2012

The code below gets me summary of min(cost) per country

ddply(df,c('country'), summarize, LowestCost = min(cost))

When I try to display min(cost) of the country along with the specific airport, I just get one airport listed

ddply(df,c('country'), summarize, LowestCost = min(cost), AirportName = df[which.min(df[,3]),1])

The output should look like below

country  LowestCost  AirportName
US       250         ORD
UK       250         LHR
FR       300         CDG
GR       100         FRA

But instead it looks like this
country  LowestCost  AirportName
US       250         ORD
UK       250         ORD
FR       300         ORD
GR       100         ORD

Any help is appreciated


Solution

  • We may use slice_min from dplyr

    library(dplyr)
    df %>%
         select(-year) %>%
         group_by(country) %>%
         slice_min(cost, n = 1) %>%
         ungroup %>%
         rename(LowestCost = cost)
    

    -output

    # A tibble: 4 x 3
      airport country LowestCost
      <chr>   <chr>        <int>
    1 CDG     FR             300
    2 FRA     GR             100
    3 LHR     UK             250
    4 ORD     US             250
    

    In the plyr, code, the which.min is applied on the whole column, instead of the grouped column. We just need to specify the column name

    plyr::ddply(df, c("country"), plyr::summarise, 
       LowestCost = min(cost), AirportName = airport[which.min(cost)])
      country LowestCost AirportName
    1      FR        300         CDG
    2      GR        100         FRA
    3      UK        250         LHR
    4      US        250         ORD
    

    data

    df <- structure(list(airport = c("ORD", "SFO", "LHR", "CDG", "FRA", 
    "ORD", "SFO", "LHR", "CDG", "FRA", "ORD", "SFO", "LHR", "CDG", 
    "FRA"), country = c("US", "US", "UK", "FR", "GR", "US", "US", 
    "UK", "FR", "GR", "US", "US", "UK", "FR", "GR"), cost = c(500L, 
    800L, 250L, 300L, 200L, 650L, 500L, 850L, 350L, 150L, 250L, 650L, 
    350L, 450L, 100L), year = c(2010L, 2010L, 2010L, 2010L, 2010L, 
    2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2012L, 2012L, 2012L, 
    2012L)), class = "data.frame", row.names = c(NA, -15L))