Search code examples
rspread

Spread columns in R generates an out of memory


I have a survey form and I need to group this dataset to a single row, but I have some problems with the use of spread and group.

My dataset has the next format: data

country date_   user_id int_id  user_name   ext_name    q_order questions   answers
AR  2019    AR-100  XP200   jhon foo    damian, khon    1   Question1 … yes
AR  2019    AR-100  XP200   jhon foo    damian, khon    2   Question2 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    3   Question3 … no apply
AR  2019    AR-100  XP200   jhon foo    damian, khon    4   Question4 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    5   Question5 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    6   Question6 … yes
US  2018    US-100  PP300   Peter fields    jhon voigh  1   Question1 … no
US  2018    US-100  PP300   Peter fields    jhon voigh  2   Question2 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  3   Question3 … yes apply
US  2018    US-100  PP300   Peter fields    jhon voigh  4   Question4 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  5   Question5 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  6   Question6 … no

I tried to group the resulting dataset, but always get 14 rows instead of 2.

Code:

data %>% 
    group_by(country=.$country  ,
             date_ = .$date_,
             medic_id=.$user_id,
             user_id= .$int_id,
             user_name= .$user_name,
             ext_name= .$ext_name,
             q_order=.$q_order
             ) %>% 
    spread(questions, answers) 

The code above , give me an out of memory.

I even tried with dcast

data %>% 
    select(-q_order) %>% 
    dcast( ...  ~ questions, value.var = "answers")

And i get the following:

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    1   2   0   1   1   1
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  0   1   1   2   1   2

but i need :

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    yes 0   no apply    0   0   yes
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  no  0   yes apply   0   0   no

Why dcast convert to numerical al the values from answers variable? (I even tried with var.values='answers')?

My question is very similar to this link!

But I cant make it run, always give out out memory or generates with numerical values instead of the values from answers variable.


Solution

  • I finally found the answer!

    The problem was (that im newby in R), that i want to have the values of some columns in rows , but, this values are characters and mostly of solutions handle numerical instead of characters!

    At the other hand, my solution (example with 5 rows) works greats with RESHAPE!, but with a (small --medium) real dataset i get an out of memory (never end).

    For example the next code never end (and yes, i tried with group too, like i said)

    b<-reshape(data=a %>% select(-q_order) ,
               direction="wide",
               idvar = c("Country.Code","Created.Date", "user_id", "int_id", "user_name",
                         "ext_name"),
               timevar="questions" )
    

    This solution run in 2 seconds:

    b<-dcast( a, Country.Code+Created.Date+user_id+int_id +user_name+ ext_name ~ questions,
              toString, value.var="answers")
    

    Finally

    Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
    AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    yes 0   no apply    0   0   yes
    US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  no  0   yes apply   0   0   no