Search code examples
r

Ignoring NA cases when getting column index of lowest value in row


I have an array of retail item prices, and am trying to create a new column that indicates which store has the lowest price.

  • The rows are the items, while the columns are the retailer names.

  • Not all of the retailers carry each product, so I need to ignore cells where the value is NA (instead of treating NA values as the lowest-cost item).

Setting everything up, I have

library(dplyr)

# Data

itemName <- c('Apples', 'Bananas', 'Milk', 'Eggs', 'Cereal')
Aldi <- c(NA, 0.55, 1.95, 157.00, 1.10)
Walmart <- c(0.75, 0.69, 2.15, 192.00, 1.10)
Costco <- c(NA, NA, 0.16, 75.00, NA)

Creating the df with a column that identifies the lowest price

retailPRICE <- data.frame(itemName, Aldi, Walmart, Costco)%>%
mutate(cheapestPRICE = pmin(Aldi, Walmart, Costco, na.rm = TRUE))

This is where I run into trouble

I'm trying to create a column that looks at all the prices, and pulls in the name of the store with the cheapest price.

retailPRICE$cheapestSTORE <- names(retailPRICE)
 [minCol(replace(retailPRICE, is.na(retailPRICE), 0), 
 ties.method = "first")]

What I want

| itemName | Aldi   | Walmart  |  Costco  | cheapestPRICE | cheapestSTORE |
| -------- | ------ | -------- | -------- | -------------   --------------
| Apples   | NA     | 0.75     | NA       | 0.75          | Walmart       |
| Bananas  | 0.55   | 0.69     | NA       | 0.55          | Aldi          |
| Milk     | 1.95   | 2.15     | 0.16     | 0.16          | Costco        |
| Eggs     | 157.00 | 192.00   | 75.00    | 75.00         | Costco        |
| Cereal   | 1.10   | 1.10     | NA       | 1.10          | Aldi          |

Problem the first

Attempting to create the cheapestSTORE column gives the following error: Error in FUN(left) : invalid argument to unary operator

  • Why? Is this because itemName is a character column?
  • Previous iterations of this code did not include the itemName column and I did not have this issue (don't worry, I had plenty of others).
  • I do need to keep the itemName to do eventually do a join, once I get this mess sorted out

Problem the second

Before I ran into the Error in FUN issue, I had a problem with how minCol was identifying the lowest value column. It was seeing NAs as the lowest value, so it would pull in the name of that store. I need to be able to exclude any NA entries from the assessment of which column is being indexed.

An example of what I was getting

| itemName | Aldi   | Walmart  |  Costco  | cheapestPRICE | cheapestSTORE |
| -------- | ------ | -------- | -------- | -------------   --------------
| Apples   | NA     | 0.75     | NA       | 0.75          | Aldi          |
| Bananas  | 0.55   | 0.69     | NA       | 0.55          | Costco        |
| Milk     | 1.95   | 2.15     | 0.16     | 0.16          | Costco        |
| Eggs     | 157.00 | 192.00   | 75.00    | 75.00         | Costco        |
| Cereal   | 1.10   | 1.10     | NA       | 1.10          | Costco        |

As you can see, the only row that is being evaluated correctly is Eggs. All of the other items are out of stock in at least one store, which is causing the cheapestSTORE column to identify wrong location as being the cheapest.

Can you please help me figure out what I should be doing when creating the cheapestSTORE column? Thank you!!


Solution

  • Because your data has prices in multiple columns, it is not tidy in the sense of Wickham 2014. I would suggest you reshape your data to long format first, so that each column represent a single variable: product, store and price.

    library(dplyr); library(tidyr)
    
    retailPRICE <- data.frame(itemName, Aldi, Walmart, Costco)
    
    retailPRICElong <- retailPRICE |> 
      pivot_longer(-itemName, names_to = 'Store', values_to = 'Price') |> 
      drop_na()
    

    We simply drop all the not-in-stock items by dropping NA values.

    Then these operations become much easier, and lead to code that is simple and easy to read:

    summarise(
      retailPRICElong,
      cheapestPrice = min(Price),
      cheapestStore = Store[which.min(Price)],   # in the case of ties, first one
      .by = itemName
    )
    
    # A tibble: 5 × 3
      itemName cheapestPrice cheapestStore
      <chr>            <dbl> <chr>        
    1 Apples            0.75 Walmart      
    2 Bananas           0.55 Aldi         
    3 Milk              0.16 Costco       
    4 Eggs             75    Costco       
    5 Cereal            1.1  Aldi
    

    For this case in particular, all you would need is:

    slice_min(retailPRICElong, Price, by = itemName)
    

    which also includes ties by default

    # A tibble: 6 × 3
      itemName Store   Price
      <chr>    <chr>   <dbl>
    1 Apples   Walmart  0.75
    2 Bananas  Aldi     0.55
    3 Milk     Costco   0.16
    4 Eggs     Costco  75   
    5 Cereal   Aldi     1.1 
    6 Cereal   Walmart  1.1