Search code examples
riterationdata-cleaning

Tidying data frame with multiple variables within multiple columns


I'm working with a "molten" data frame, though perhaps "volcano" is a more apt description of it.

Basically, there are two columns in this data frame that break the first rule of tidy data (i.e., each variable is a column). The first column, variable_name, has seven different variables within it, and the second column, indicator_name, has 10 variables.

These variables are contained within a single column as factor levels and are different from each other (i.e., one could be age range, while another could be a particular region/state). It looks something like the following, but I have shortened it to simplify the example a bit (assume more variables under variable_name and indicator_name, many years besides 2010 in year, and sex includes both, male, and female).

year sex variable_name variable_category indicator_name indicator_category value
2010 both education no degree employment status employed 200
2010 both education no degree employment status unemployed 400
2010 both education one or more degrees employment status employed 400
2010 both education one or more degrees employment status unemployed 200
2010 both age group 25-54 employment status employed 350
2010 both age group 25-54 employment status unemployed 200
2010 both age group 55+ employment status employed 250
2010 both age group 55+ employment status unemployed 100
2010 both region east coast employment status employed 2000
2010 both region east coast employment status unemployed 500
2010 both region west coast employment status employed 3000
2010 both region west coast employment status unemployed 1000
2010 both education no degree job tenure less than 5 years 125
2010 both education no degree job tenure more than 5 years 75

Usually, cleaning long data with multiple variables within a column would require using pivot_wider(). However, I'm not sure this is the best option for this dataset. Since there are a lot of variables in these two columns, pivoting them out creates a ton of columns, which wouldn't really make it any easier to work with. In this case, and if I understand the rules of tidy data (please correct me if I'm wrong), I believe there should be 70 data frames (one table per observational unit).

What I've done instead is group_by() the variable_name and indicator_name columns and then use group_split() to create an object that's a list of 70 data frames. From there, I feel you can just create separate objects from each data frame in the list object and examine and/or join different tables of interest as need be.

require(dplyr)

data_list <-
data |>
group_by(variable_name, indicator_name) |>
group_split()

edu_empl_status <- data_list[[1]]

edu_empl_status
year sex variable_name variable_category indicator_name indicator_category value
2010 both education no degree employment status employed 200
2010 both education no degree employment status unemployed 400
2010 both education one or more degrees employment status employed 400
2010 both education one or more degrees employment status unemployed 200

However, creating 70 objects would be tedious without iterating. Is there a way to create multiple new objects by iterating over a single list object (maybe using purrr or the apply family of functions?). Or is this even the right way of going about fixing this data frame? Would it be better to keep it as a single dataframe and nest variables?

Anyone with experience working with these extremely molten/long data sets on the best way to break it up and clean it? Sorry if this question is unclear or if I've misunderstood anything.


Solution

  • I suspect you are on the right path. The first thing I would do will be to deal with those columns that have multiple variables in it. I have written an iterative function here that will create a new column for each unique entry in the "variable_name"/"indicator_name" columns.

    # Replicating your dataset    
    data <- data.frame(
        year = rep(2010, 14),
        sex = rep("both", 14),
        variable_name = c("education", "education", "education", "education", "age group", "age group", 
    "age group", "age group", "region", "region", "region", "region", "education", "education"),
        variable_category = c("no degree", "no degree", "one or more degrees", "one or more degrees", "25-54", "25-54", "55+", "55+", "east coast", "east coast", "west coast", "west coast", "no degree", "no degree"),
        indicator_name = c(rep("employment status", 12), rep("job tenure", 2)),
        indicator_category = c("employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "less than 5 years", "more than 5 years"),
        value = c(200, 400, 400, 200, 350, 200, 250, 100, 2000, 500, 3000, 1000, 125, 75)
    )
    
    
    # Create an iterative function to generate new columns for unique values in variable_name and indicator_name
    create_columns <- function(data, name_col, category_col) {
      unique_names <- unique(data[[name_col]])
      for (name in unique_names) {
        new_col <- ifelse(data[[name_col]] == name, data[[category_col]], NA)
        col_name <- paste(name_col, name, sep = "_")
        data[[col_name]] <- new_col
      }
      return(data)
    }
    
    # Apply the function to both variable_name and indicator_name columns
    data <- create_columns(data, "variable_name", "variable_category")
    data <- create_columns(data, "indicator_name", "indicator_category")
    
    # Drop the original variable_name, variable_category, indicator_name, and indicator_category columns
    data <- data %>%
      select(-variable_name, -variable_category, -indicator_name, -indicator_category)
    

    print(data) enter image description here