Search code examples
rlistdataframeapriori

How to convert a dataframe in long format into a list of an appropriate format?


I have a dataframe in the following long format:

1

I need to convert it into a list which should look something like this: 2

Wherein, each of the main element of the list would be the "Instance No." and its sub-elements should contain all its corresponding Parameter & Value pairs - in the format of "Parameter X" = "abc" as you can see in the second picture, listed one after the other.

Is there any existing function which can do this? I wasn't really able to find any. Any help would be really appreciated.

Thank you.


Solution

  • A dplyr solution

    require(dplyr)
    df_original <- data.frame("Instance No." = c(3,3,3,3,5,5,5,2,2,2,2),
                          "Parameter" = c("age", "workclass", "education", "occupation", 
                                          "age", "workclass", "education", 
                                          "age", "workclass", "education", "income"),
                          "Value" = c("Senior", "Private", "HS-grad", "Sales",
                                      "Middle-aged", "Gov", "Hs-grad",
                                      "Middle-aged", "Private", "Masters", "Large"),
                          check.names = FALSE)
        
    # the split function requires a factor to use as the grouping variable.
    # Param_Value will be the properly formated vector
    df_modified <- mutate(df_original,
                          Param_Value = paste0(Parameter, "=", Value))
    # drop the parameter and value columns now that the data is contained in Param_Value
    df_modified <- select(df_modified,
                          `Instance No.`,
                          Param_Value)
    
    # there is now a list containing dataframes with rows grouped by Instance No.
    list_format <- split(df_modified, 
                         df_modified$`Instance No.`)
    
    # The Instance No. is still in each dataframe. Loop through each and strip the column.
    list_simplified <- lapply(list_format, 
                              select, -`Instance No.`)
    
    # unlist the remaining Param_Value column and drop the names.                      
    list_out <- lapply(list_simplified , 
                       unlist, use.names = F)
                         
    

    There should now be a list of vectors formatted as requested.

    $`2`
    [1] "age=Middle-aged"   "workclass=Private" "education=Masters" "income=Large"     
    
    $`3`
    [1] "age=Senior"        "workclass=Private" "education=HS-grad" "occupation=Sales" 
    
    $`5`
    [1] "age=Middle-aged"   "workclass=Gov"     "education=Hs-grad"
    

    The posted data.table solution is faster, but I think this is a bit more understandable.