Search code examples
rdataframeduplicateshierarchical-data

Is there a more efficient way to handle facts which are duplicating in an R dataframe?


I have a dataframe which looks like this:

ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")

df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)

The dataframes dimensions work like this:

  • There will always be an ID/key which singularly and uniquely identifies a submitted fact
  • There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.
  • Most of the time - but not always - there will be a dimension for a "Descriptor",
  • If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".

The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact. What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.

I've achieved it by doing this:

df1 <- pivot_wider(df, 
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")

ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()


df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")

But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.

Is there a better way to do this?


Solution

  • You can unite the columns and for each ID combine them together and take average of Fact values.

    library(dplyr)
    library(tidyr)
    
    df %>%
      unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
      group_by(ID) %>%
      summarise(Descriptor = paste0(Descriptor, collapse = '_'), 
                mean_sel = mean(Fact, na.rm = TRUE))
    
    #     ID Descriptor                                               mean_sel
    #  <dbl> <chr>                                                       <dbl>
    #1     1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas…      233
    #2     2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans…       50
    #3     3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi…       15