Search code examples
rdata.tablesapplyrbindlist

Import text file separated by new line


Ultimately, I need to create the following table:

ID Age Gender
1  25  F
2  14  M

and so on. . .

I got my data in many txt files, where the ID is the name of the file. Age and Gender are separated by new line, for example:

Age: 25
Gender: F

Now, I merge all the files into a one data table with the file path as the ID column (later I'll get rid of the whole path with stringr)

all_test <- list.files(path = "my/file/path", full.names=T)
dtt <- rbindlist( sapply(all_test, fread, simplify = FALSE, sep=":", sep2=" "),use.names = TRUE, idcol = "ID" )

And here is my problem - I'm getting the following table:

ID V1     V2
1  Age    25
2  Gender F

I tried using sep = "\n" but it did nothing. How can I tell R each variable is in a different line? (preferably with data.table)


Solution

  • A possible approach is to reshape into wide format with dcast and use type.convert to get the columns into the right classes:

    DTnew <- dcast(DT, ID ~ V1, value.var = 'V2')[, (2:3) := lapply(.SD, type.convert), .SDcols = 2:3][]
    

    which gives:

    > DTnew
       ID Age Gender
    1:  1  25      F
    2:  2  14      M
    
    > str(DTnew)
    Classes ‘data.table’ and 'data.frame':    2 obs. of  3 variables:
     $ ID    : int  1 2
     $ Age   : int  25 14
     $ Gender: Factor w/ 2 levels "F","M": 1 2
     - attr(*, ".internal.selfref")=<externalptr> 
     - attr(*, "sorted")= chr "ID"
    

    Example data:

    DT1 <- fread('Age: 25
                  Gender: F', sep = ':')
    DT2 <- fread('Age: 14
                  Gender: M', sep = ':')
    DT <- rbindlist(list(DT1, DT2), idcol = 'ID')