I'm working with a "molten" data frame, though perhaps "volcano" is a more apt description of it.
Basically, there are two columns in this data frame that break the first rule of tidy data (i.e., each variable is a column). The first column, variable_name
, has seven different variables within it, and the second column, indicator_name
, has 10 variables.
These variables are contained within a single column as factor levels and are different from each other (i.e., one could be age range, while another could be a particular region/state). It looks something like the following, but I have shortened it to simplify the example a bit (assume more variables under variable_name
and indicator_name
, many years besides 2010 in year
, and sex
includes both, male, and female).
year | sex | variable_name | variable_category | indicator_name | indicator_category | value |
---|---|---|---|---|---|---|
2010 | both | education | no degree | employment status | employed | 200 |
2010 | both | education | no degree | employment status | unemployed | 400 |
2010 | both | education | one or more degrees | employment status | employed | 400 |
2010 | both | education | one or more degrees | employment status | unemployed | 200 |
2010 | both | age group | 25-54 | employment status | employed | 350 |
2010 | both | age group | 25-54 | employment status | unemployed | 200 |
2010 | both | age group | 55+ | employment status | employed | 250 |
2010 | both | age group | 55+ | employment status | unemployed | 100 |
2010 | both | region | east coast | employment status | employed | 2000 |
2010 | both | region | east coast | employment status | unemployed | 500 |
2010 | both | region | west coast | employment status | employed | 3000 |
2010 | both | region | west coast | employment status | unemployed | 1000 |
2010 | both | education | no degree | job tenure | less than 5 years | 125 |
2010 | both | education | no degree | job tenure | more than 5 years | 75 |
Usually, cleaning long data with multiple variables within a column would require using pivot_wider()
. However, I'm not sure this is the best option for this dataset. Since there are a lot of variables in these two columns, pivoting them out creates a ton of columns, which wouldn't really make it any easier to work with. In this case, and if I understand the rules of tidy data (please correct me if I'm wrong), I believe there should be 70 data frames (one table per observational unit).
What I've done instead is group_by()
the variable_name
and indicator_name
columns and then use group_split()
to create an object that's a list of 70 data frames. From there, I feel you can just create separate objects from each data frame in the list object and examine and/or join different tables of interest as need be.
require(dplyr)
data_list <-
data |>
group_by(variable_name, indicator_name) |>
group_split()
edu_empl_status <- data_list[[1]]
edu_empl_status
year | sex | variable_name | variable_category | indicator_name | indicator_category | value |
---|---|---|---|---|---|---|
2010 | both | education | no degree | employment status | employed | 200 |
2010 | both | education | no degree | employment status | unemployed | 400 |
2010 | both | education | one or more degrees | employment status | employed | 400 |
2010 | both | education | one or more degrees | employment status | unemployed | 200 |
However, creating 70 objects would be tedious without iterating. Is there a way to create multiple new objects by iterating over a single list object (maybe using purrr
or the apply
family of functions?). Or is this even the right way of going about fixing this data frame? Would it be better to keep it as a single dataframe and nest variables?
Anyone with experience working with these extremely molten/long data sets on the best way to break it up and clean it? Sorry if this question is unclear or if I've misunderstood anything.
I suspect you are on the right path. The first thing I would do will be to deal with those columns that have multiple variables in it. I have written an iterative function here that will create a new column for each unique entry in the "variable_name"/"indicator_name" columns.
# Replicating your dataset
data <- data.frame(
year = rep(2010, 14),
sex = rep("both", 14),
variable_name = c("education", "education", "education", "education", "age group", "age group",
"age group", "age group", "region", "region", "region", "region", "education", "education"),
variable_category = c("no degree", "no degree", "one or more degrees", "one or more degrees", "25-54", "25-54", "55+", "55+", "east coast", "east coast", "west coast", "west coast", "no degree", "no degree"),
indicator_name = c(rep("employment status", 12), rep("job tenure", 2)),
indicator_category = c("employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "employed", "unemployed", "less than 5 years", "more than 5 years"),
value = c(200, 400, 400, 200, 350, 200, 250, 100, 2000, 500, 3000, 1000, 125, 75)
)
# Create an iterative function to generate new columns for unique values in variable_name and indicator_name
create_columns <- function(data, name_col, category_col) {
unique_names <- unique(data[[name_col]])
for (name in unique_names) {
new_col <- ifelse(data[[name_col]] == name, data[[category_col]], NA)
col_name <- paste(name_col, name, sep = "_")
data[[col_name]] <- new_col
}
return(data)
}
# Apply the function to both variable_name and indicator_name columns
data <- create_columns(data, "variable_name", "variable_category")
data <- create_columns(data, "indicator_name", "indicator_category")
# Drop the original variable_name, variable_category, indicator_name, and indicator_category columns
data <- data %>%
select(-variable_name, -variable_category, -indicator_name, -indicator_category)