Search code examples
rdataframereshapetidyrreshape2

How to Reshape DF with categorical variables from long to wide in R?


I am new to reshaping data frames. I have a df that I would like to make wider so I can use it in analysis such as cluster and NMDS. I have found several questions and (answers) related to how to reshape data containing primarily quantitative data (using aggregation functions), but in my case, my variables are all categorical.

As my df has a thousand rows and dozens of columns, I've created a toy df to use as an example. It looks like this:

df <- data.frame(
  id=c("a","c", "a","b","d","c","e","d","c","a","a","e","a","b","d"), 
  color=c("red", "blue", "gray", "yellow", "green","green","blue","purple"            ,"black","green","yellow","blue","red","yellow","gray"),
  fruit=c("apple", "orange", "avocado", "strawberry", "banana", "apple",               "orange", "avocado", "strawberry", "banana","banana", "strawberry",           "watermelon", "lemon", "lemon" ),
  country = c("Italy", "Spain", "Brazil", "Brazil", "Australia", "Italy",           "Japan", "India", "USA", "Mexico", "USA", "Mexico", "Spain",              "France", "France"),
  animal=c("alligator", "camel", "alligator", "bat", "dolphin", "camel",                "elephant", "dolphin", "camel", "alligator", "alligator",                    "elephant", "alligator", "bat", "dolphin")) 

I would like the column "id" to be the first in my reshaped data frame, "animal" to be the second, and then the levels of "color", "fruit", and "country". The point here is that I wanted them separated.

The code below shows some attempts I've made:

df <- dplyr::mutate_if(df,is.character,as.factor) 
attach(df)

dcast(df, id ~ color,value.var = "id") #The output is exactly what I wanted! 

dcast(df, id + animal ~ color,value.var = "id") #Exactly what I wanted!

dcast(df, id + animal ~ fruit,value.var = "id") #Exactly what I wanted!

dcast(df, id ~ country, value.var = "id") #Not the output I wanted. Only "works well" if I specify "fun.aggregate=length". Why?

dcast(df, id ~ color + country, value.var = "id") #Not the output what I wanted.

dcast(df, id + animal~ color + country, value.var = "id") #Not the output I wanted.

dcast(df, id + animal~ color + country + fruit, value.var = "id") #Not the output I wanted.

My expected reshaped df should look like this:

Expected reshape data frame

To achieve that, I've tried all commands below, but none of those worked well:

dcast(df, id + animal ~ color + country + fruit, fun.aggregate=length)

dcast(df, id + animal ~ c(color, country, fruit), fun.aggregate=length)

dcast(df, id + animal ~ c("color", "country", "fruit"), fun.aggregate=length)

dcast(df, id + animal ~ color:fruit, fun.aggregate=length)

I have also tried to do that using tidyr::pivot_wider, without success.

Is there a way to achieve my goal using reshape2::dcast or tidyr::pivot_wider, or any other function in R? I would appreciate it if you guys could help me. Thanks in advance.


Solution

  • First, you have to pivot_longer to get the column names you desire into a column. Then I arranged it by the future column names, so the words would be grouped, like your image, then I used pivot_wider. It drops the animal column, so I put it back, then arranged by id, so they would be in the same observation order as your image.

    pivot_longer(df, cols = color:country, names_to = "variable", 
                 values_to = "value") %>%                       # column names to rows
      arrange(variable, value) %>%                              # organize future column names
      pivot_wider(!variable, names_from = value, values_from = animal, 
                  values_fn = list(animal = length), values_fill = 0) %>%
      left_join(distinct(df[,c(1,5)])) %>%                      # add animals back
      select(id, animal, everything()) %>%                      # rearrange columns
      arrange(id)                                               # reorder observations
    

    enter image description here

    Update based on your comment - ordered by color, fruit, then country

    Added mutate and modified the first arrange and pivot_wider:

    pivot_longer(df,cols = color:country, names_to = "variable", 
                 values_to = "value") %>%                # future col names to rows
      mutate(ordering = ifelse(variable == "color", 1,   # create organizer variable
                               ifelse(variable == "fruit", 2, 3))) %>% 
      arrange(ordering, value) %>%                       # organize future column order
      pivot_wider(!c(variable,ordering),                 # make it wide
                  names_from = value, 
                  values_from = animal, 
                  values_fn = list(animal = length), 
                  values_fill = 0) %>%
      left_join(distinct(df[,c(1,5)])) %>%               # add the animals back
      select(id, animal, everything()) %>%               # move animals to 2nd position
      arrange(id)                                        # reorder observations 
    

    Check it out: enter image description here