Tidying data frame with multiple variables within multiple columns

I'm working with a "molten" data frame, though perhaps "volcano" is a more apt description of it.

Basically, there are two columns in this data frame that break the first rule of tidy data (i.e., each variable is a column). The first column, variable_name, has seven different variables within it, and the second column, indicator_name, has 10 variables.

These variables are contained within a single column as factor levels and are different from each other (i.e., one could be age range, while another could be a particular region/state). It looks something like the following, but I have shortened it to simplify the example a bit (assume more variables under variable_name and indicator_name, many years besides 2010 in year, and sex includes both, male, and female).

year	sex	variable_name	variable_category	indicator_name	indicator_category	value
2010	both	education	no degree	employment status	employed	200
2010	both	education	no degree	employment status	unemployed	400
2010	both	education	one or more degrees	employment status	employed	400
2010	both	education	one or more degrees	employment status	unemployed	200
2010	both	age group	25-54	employment status	employed	350
2010	both	age group	25-54	employment status	unemployed	200
2010	both	age group	55+	employment status	employed	250
2010	both	age group	55+	employment status	unemployed	100
2010	both	region	east coast	employment status	employed	2000
2010	both	region	east coast	employment status	unemployed	500
2010	both	region	west coast	employment status	employed	3000
2010	both	region	west coast	employment status	unemployed	1000
2010	both	education	no degree	job tenure	less than 5 years	125
2010	both	education	no degree	job tenure	more than 5 years	75

Usually, cleaning long data with multiple variables within a column would require using pivot_wider(). However, I'm not sure this is the best option for this dataset. Since there are a lot of variables in these two columns, pivoting them out creates a ton of columns, which wouldn't really make it any easier to work with. In this case, and if I understand the rules of tidy data (please correct me if I'm wrong), I believe there should be 70 data frames (one table per observational unit).

What I've done instead is group_by() the variable_name and indicator_name columns and then use group_split() to create an object that's a list of 70 data frames. From there, I feel you can just create separate objects from each data frame in the list object and examine and/or join different tables of interest as need be.

require(dplyr)

data_list <-
data |>
group_by(variable_name, indicator_name) |>
group_split()

edu_empl_status <- data_list[[1]]

edu_empl_status

year	sex	variable_name	variable_category	indicator_name	indicator_category	value
2010	both	education	no degree	employment status	employed	200
2010	both	education	no degree	employment status	unemployed	400
2010	both	education	one or more degrees	employment status	employed	400
2010	both	education	one or more degrees	employment status	unemployed	200

However, creating 70 objects would be tedious without iterating. Is there a way to create multiple new objects by iterating over a single list object (maybe using purrr or the apply family of functions?). Or is this even the right way of going about fixing this data frame? Would it be better to keep it as a single dataframe and nest variables?

Anyone with experience working with these extremely molten/long data sets on the best way to break it up and clean it? Sorry if this question is unclear or if I've misunderstood anything.

Solution

I suspect you are on the right path. The first thing I would do will be to deal with those columns that have multiple variables in it. I have written an iterative function here that will create a new column for each unique entry in the "variable_name"/"indicator_name" columns.

# Replicating your dataset    
data <- data.frame(
    year = rep(2010, 14),
    sex = rep("both", 14),
    variable_name = c("education", "education", "education", "education", "age group", "age group", 
"age group", "age group", "region", "region", "region", "region", "education", "education"),
    variable_category = c("no degree", "no degree", "one or more degrees", "one or more degrees", "25-54", "25-54", "55+", "55+", "east coast", "east coast", "west coast", "west coast", "no degree", "no degree"),
    indicator_name = c(rep("employment status", 12), rep("job tenure", 2)),
    indicator_category = c("employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "less than 5 years", "more than 5 years"),
    value = c(200, 400, 400, 200, 350, 200, 250, 100, 2000, 500, 3000, 1000, 125, 75)
)


# Create an iterative function to generate new columns for unique values in variable_name and indicator_name
create_columns <- function(data, name_col, category_col) {
  unique_names <- unique(data[[name_col]])
  for (name in unique_names) {
    new_col <- ifelse(data[[name_col]] == name, data[[category_col]], NA)
    col_name <- paste(name_col, name, sep = "_")
    data[[col_name]] <- new_col
  }
  return(data)
}

# Apply the function to both variable_name and indicator_name columns
data <- create_columns(data, "variable_name", "variable_category")
data <- create_columns(data, "indicator_name", "indicator_category")

# Drop the original variable_name, variable_category, indicator_name, and indicator_category columns
data <- data %>%
  select(-variable_name, -variable_category, -indicator_name, -indicator_category)

print(data)