Search code examples
rxlsx

R Beginner struggling with extremely messy XLSX


I got an XLSX with data from a questionnaire for my master thesis. The questions and answers for an interviewee are in one row in the second column. The first column contains the date.

The data of the second column comes in a form like this:

"age":"52","height":"170","Gender":"Female",...and so on

I started with:

test12 <- read_xlsx("Testdaten.xlsx")

library(splitstackshape) test13 <- concat.split(data = test12, split.col= "age", sep =",")

Then I got the questions and the answers as a column divided by a ":". For e.g. column 1: "age":"52" and column2:"height":"170". But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.

I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.

Here is an example of the data: A tibble: 5 x 2 partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~ 5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""

Thank you so much for your time!


Solution

  • You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.

    The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.

    data

    Updated the data based on your edit.

    data <- tribble(~partner.createdAt,        ~partner.wphg.info,
                    '2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
                    '2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
                    '2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
                    '2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
                    '2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
    

    libraries

    We need a few here. Or you can just call tidyverse.

    library(stringr)
    library(purrr)
    library(dplyr)
    library(tibble)
    library(tidyr)
    

    function

    This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.

    clean_record <- function(date, text) {
    
      clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
        str_remove_all(pattern = "\\\"") %>%         # remove double quote
        str_remove_all(pattern = "\\{|\\}") %>%      # remove curly brackets
        str_split(pattern = ":", simplify = TRUE)
    
      tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
    
    }
    

    iteration

    Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.

    This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.

    What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.

    data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
      mutate(variable = tolower(variable)) %>%
      distinct() %>%
      pivot_wider(names_from = variable, values_from = value)
    

    result

    The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.

    # A tibble: 5 x 10
      record date       age_years job_des       height_cm gender born_in       alcoholic knowledge_selfass total_wealth
      <chr>  <date>     <chr>     <chr>         <chr>     <chr>  <chr>         <chr>     <chr>             <chr>       
    1 1      2019-11-09 50        unemployed    170       female Italy         false     5                 200000      
    2 2      2019-11-01 34        self-employed 158       male   Germany       true      3                 10000       
    3 3      2019-11-10 24        NA            187       male   England       false     3                 150000      
    4 4      2019-11-11 59        employed      167       female United States false     2                 1000000     
    5 5      2019-11-08 36        employed      180       male   Germany       false     5                 170000
    

    For example, convert age_years to numeric.

    data_clean %>%
      mutate(age_years = as.numeric(age_years))
    

    I am sure you may run into other things, but this should be a start.