Search code examples
rdataformatfactor-analysislong-format-datawide-format-data

Best way to format this data for exploratory factor analysis, using R?


I have extremely little experience with factor analysis so my question may beyond rudimentary/remedial. My question is about data formatting. I have this dataset (well, this is just the first two of several hundred rows):

df <- structure(list(X.Case.ID. = 310:311, GENDER = c(1L, 1L), AGE = c(32L, 
45L), EMPLOYMENT = c(1L, 1L), EDUCATION = c(6L, 6L), FUNCTION = c(6L, 
6L), A = c(1L, 1L), SECTOR = 1:2, EMPLOYEES = c(567L, 500L), 
    STATE = c(35L, 10L), REGION = 3:2, REVENUE = 9:8, Q1 = c(2L, 
    2L), Q2 = 2:1, Q3 = 1:2, Q4 = c(1L, 3L), Q5 = 3:2, Q6_C1 = c(0L, 
    0L), Q6_C2 = 0:1, Q6_C3 = c(0L, 0L), Q6_C4 = 1:0, Q6_C5 = 0:1, 
    Q6_C6 = c(0L, 0L), Q6_C7 = c(0L, 0L), Q6_C8 = c(0L, 0L), 
    Q6_C9 = c(0L, 0L), O_Q6_C9 = c(NA, NA), Q7_C1 = 0:1, Q7_C2 = c(0L, 
    0L), Q7_C3 = c(0L, 0L), Q7_C4 = c(0L, 0L), Q7_C5 = 1:0, Q7_C6 = 0:1, 
    Q7_C7 = c(0L, 0L), Q7_C8 = c(0L, 0L), Q7_C9 = c(0L, 0L), 
    Q7_C10 = c(0L, 0L), O_Q7_C9 = c(NA, NA), Q8 = 4:3, Q9_C1 = c(0L, 
    0L), Q9_C2 = 0:1, Q9_C3 = c(0L, 0L), Q9_C4 = c(0L, 0L), Q9_C5 = c(1L, 
    1L), Q9_C6 = 0:1, Q9_C7 = c(0L, 0L), Q9_C8 = c(0L, 0L), Q9_C9 = c(0L, 
    0L), Q10 = c(3L, 1L), Q11_C1 = c(0L, 0L), Q11_C2 = 1:0, Q11_C3 = 0:1, 
    Q11_C4 = 0:1, Q11_C5 = 1:0, Q11_C6 = c(0L, 0L), Q11_C7 = c(0L, 
    0L), Q12 = c(34L, 15L), Q13 = c("99,994", "700"), Q14 = 1:2, 
    Q15 = 1:2, Q16_C1 = c(0L, 0L), Q16_C2 = c(0L, 0L), Q16_C3 = c(0L, 
    0L), Q16_C4 = c(0L, 0L), Q16_C5 = c(0L, 0L), Q16_C6 = c(0L, 
    0L), Q16_C7 = c(0L, 0L), Q16_C8 = c(0L, 0L), Q16_C9 = c(0L, 
    0L), Q16_C10 = c(0L, 0L), Q16_C11 = c(0L, 0L), O_Q16_C11 = c(NA, 
    NA), Q17_C1 = c(0L, 0L), Q17_C2 = c(0L, 0L), Q17_C3 = c(0L, 
    0L), Q17_C4 = c(0L, 0L), Q17_C5 = 1:0, Q17_C6 = 0:1, Q17_C7 = c(0L, 
    0L), Q17_C8 = 0:1, Q17_C9 = 0:1, Q17_C10 = 1:0, Q17_C11 = c(0L, 
    0L), Q17_C12 = c(0L, 0L), O_Q17_C11 = c(NA, NA), Q18 = c(5L, 
    3L), Q19_C1 = 0:1, Q19_C2 = 1:0, Q19_C3 = c(0L, 0L), Q19_C4 = 0:1, 
    Q19_C5 = c(0L, 0L), Q19_C6 = c(0L, 0L), Q19_C7 = c(0L, 0L
    ), Q19_C8 = c(0L, 0L), O_Q19_C7 = c(NA, NA), Q20_M1 = c(7L, 
    1L), Q20_M2 = 2:3, Q20_M3 = 3:4, Q20_M4 = 6:5, Q20_M5 = c(5L, 
    2L), Q20_M6 = c(4L, 7L), Q20_M7 = c(1L, 6L), YEARSATCOMPANY = c(7L, 
    10L), YEARSINBUS = c(12L, 10L), OFFICES = 3:2, SEGMENTS = c("Enterprise", 
    "Commercial"), DEDICATED_STAFF = c(">=20", "<20"), HOURS = c(">=10000", 
    "<1000"), VULNERABLE = c("Vulnerable", "Not Vulnerable"), 
    Hours.Adjusted = c("10000", "700"), Commerical.or.Enterprise = c("Enterprise", 
    "Commercial"), Q13.num = c(99994, 700), Segments = c(2, 1
    ), Dedicated.Staff = c(2, 1), Vulnerable = c(2, 1), Hours = c(4, 
    1)), row.names = 1:2, class = "data.frame")

I realize the column names and the values aren't clear but we have some questions, like the demographics, that each person has a single response to.

There are questions like Q1: Q1

And then there are the questions like Q6, which has 9 parts to it (O_Q6_C9 and others like it are empty). Here are the first 3 parts, for brevity.

Q6

Considering questions like Q6 are "long" questions in wide format and we have one-to-one and one-to-many, what's the best way to format this data for an exploratory factor analysis? Leave it wide format or put it in long format which will result in the one-to-one variables to be duplicated? I do know I have other data cleaning to do and converting character data to numeric, etc.

Thanks in advance!


Solution

  • This is bordering on a statistics question and not a programming question, so my answer will be brief here.

    The basic concept behind formatting data in any particular layout it is to help you achieve your goal. In this case, your goal is to carry out your factor analysis. Therefore, whatever we decide must make sense in the context of that objective.

    However, a detail is missing: what features do you actually plan to use in the factor analysis? How are you going to encode these various responses? Do you need to do any preliminary data cleaning? Do you want to restrict your focus to certain questions, or use all of them? And are there any complicating statistical factors involved, like repeated measures?

    The best data layout is the one that makes your factor analysis easy. How do you decide what that's going to be? You're going to need a combination of domain knowledge and exploratory data analysis to figure that out.

    So I would argue that your goal at this point in the project should not even involve doing the factor analysis yet. It should be understanding the data that you actually have, so that you can answer the questions that you need to answer, in order to actually carry out the factor analysis.

    Beyond that, I think this question might be too broad and too off-topic to answer further here. I suggest getting started with some basic EDA, and then asking targeted follow-up questions as needed on the statistics Q&A site: https://stats.stackexchange.com