Search code examples
rcsvrbinddo.call

In R when creating names for columns from delimited text in another column, new columns names are only assigned from the 1st row


Using R to create names for columns from delimited text in another column, the names for the new columns are only being taken from the first row, the rest are labelled NA. even though they have the correct value assigned.

The data consists of valid semicolon delimited values as well as empty and NULL values in one column. I am trying to create new columns named from each valid delimited value and assign a value of 1 to the new column for the row which that column name is found.

e.g.

  A  B                                       C E Domestic Glue_Sniffing NA NA NA NA NA NA NA
1 1  0 ;Domestic;;Domestic abuse;Glue Sniffing 7        1            NA NA NA NA NA NA NA NA
2 2  4                             ;Drug;Abus; 8       NA            NA  1  1 NA NA NA NA NA
3 3  6          ;Drug;Domestic Abuse;Domestic; 9        1            NA  1 NA NA NA NA NA NA
4 4  5                       ;Alcohol;;Verbal; 5       NA            NA NA NA  1  1 NA NA NA
5 5  7                      ;Shinpads;Abus ; ; 6       NA            NA NA  1 NA NA  1 NA NA
6 6  9                                         7       NA            NA NA NA NA NA NA NA NA
7 7 12                                   ;Ail; 8       NA            NA NA NA NA NA NA NA  1
8 8 10                                         7       NA            NA NA NA NA NA NA NA NA
9 9  9                                       ; 8       NA            NA NA NA NA NA NA NA NA

The problem is that it only takes the names from the first row queried. I used the following answer as a template

Where am I going wrong? The code is as per the template but modified to add "=1" to each delimited element as follows:

#Define a function to take vectors like c("A=1","B=2") and changed them into named vectors like c(A="1", B="2").
createNamedVectors <- function(x) {

    a <- strsplit(x,"=")
    setNames(sapply(a,'[',2), sapply(a,'[',1))

}


tmp.df<-data.frame(
    A = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9), 
    B = c(0L, 4, 6, 5L, 7L, 9L, 12L, 10L, 9), 
    C = c(";Ailment;Drug;Abus;Domestic;Domestic abuse;Glue Sniffing", 
          ";Drug;Abus;", 
          ";Drug;Domestic Abuse;Domestic;",
          ";Alcohol;;Verbal;",
          ";Shinpads; ;",
          "",
          ";Ail;",
          " ",
          ";"),
    D = c(";Vodka=2;Drug;Abus;", 
          ";Drug;Abu;", 
          ";Alcohol;Drug;Verbal;",
          ";Drug;Doms;",
          ";Shinpads;",
          " ",
          "",
          ";Ail;",
          "New"),
    E = c(7L, 8L, 9L, 5L, 6L, 7L, 8L, 7L, 8), 
    stringsAsFactors=T
)

DelimitedNamesOfNewCols <- str_replace_all(as.character(tmp.df$C),"Domestic [Aa]buse","Domestic")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,";*[[:space:]]*;",";")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,"^;","")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,";","=1;")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,"^[[:space:]]+","DUMMY=;")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,"[[:space:]]+$","DUMMY=;")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,"[[:space:]]","_")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,"^$","DUMMY=;")
DelimitedNamesOfNewCols <- str_replace_all(DelimitedNamesOfNewCols,";$","")



ColsAndValsAsNamedVectors <-lapply(strsplit(DelimitedNamesOfNewCols,";"), createNamedVectors)


#Get list of all column names, then trim and remove NA and blanks
UniqueColumnNames <-unique(unlist(sapply(ColsAndValsAsNamedVectors, names)))
UniqueColumnNames <- stri_trim(UniqueColumnNames)
UniqueColumnNames <- UniqueColumnNames[!is.na(UniqueColumnNames)]
UniqueColumnNames <- UniqueColumnNames[stri_cmp_gt(UniqueColumnNames,"")]


#Extract data from all rows for every column
DataFromRowsForEachColumn <-do.call(rbind, lapply(ColsAndValsAsNamedVectors, '[', UniqueColumnNames))


#Convert everything to numeric
class(DataFromRowsForEachColumn)<-"numeric"

#Rejoin with original data.frame removing column 4
cbind(tmp.df[,-4], DataFromRowsForEachColumn)

Runnning through and checking with a multitude of debug statements shows that the Names and values are assigned correctly for all entries as far as the return from the following statement

lapply(ColsAndValsAsNamedVectors, '[', UniqueColumnNames)

which is found in the statement

DataFromRowsForEachColumn <- do.call(rbind, lapply(ColsAndValsAsNamedVectors, '[', UniqueColumnNames))

However in DataFromRowsForEachColumn only the column names from the first row there.

Do I need to go back to the drawing board, have I made a glaringly obvious error or is there a more elegant solution that I am missing?

All replies gratefully received


Solution

  • There is very possibly a better way to do this, but your code is almost working so let's get it there.

    As you say, everything's fine until we get here

    # problem line
    DataFromRowsForEachColumn <-do.call(rbind, lapply(ColsAndValsAsNamedVectors, '[', UniqueColumnNames))
    

    If we just run the lapply, we can see that it's working right, but only in each list element only the found columns are named. rbind won't add to the names, it will just take the names from the first list element.

    lapply(ColsAndValsAsNamedVectors, '[', UniqueColumnNames)
    # [[1]]
    #       Ailment          Drug          Abus      Domestic Glue_Sniffing          <NA>          <NA>          <NA>          <NA>          <NA> 
    #           "1"           "1"           "1"           "1"            NA            NA            NA            NA            NA            NA 
    # 
    # [[2]]
    # <NA> Drug Abus <NA> <NA> <NA> <NA> <NA> <NA> <NA> 
    #   NA  "1"  "1"   NA   NA   NA   NA   NA   NA   NA 
    # 
    # [[3]]
    #     <NA>     Drug     <NA> Domestic     <NA>     <NA>     <NA>     <NA>     <NA>     <NA> 
    #       NA      "1"       NA      "1"       NA       NA       NA       NA       NA       NA 
    # 
    # [[4]]
    #    <NA>    <NA>    <NA>    <NA>    <NA> Alcohol  Verbal    <NA>    <NA>    <NA> 
    #      NA      NA      NA      NA      NA     "1"     "1"      NA      NA      NA 
    # 
    # [[5]]
    #     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA> Shinpads     <NA>     <NA> 
    #       NA       NA       NA       NA       NA       NA       NA      "1"       NA       NA 
    

    The easy fix is to just fix the names in the result:

    colnames(DataFromRowsForEachColumn) = UniqueColumnNames
    

    Add that line, and I think you'll be happy with the final result.


    Other comments:

    I'm very skeptical of your stri_trim line - it seems too late in your pipeline. I think you need to trim earlier (if it needs it). It does seem like you take care of it with your "^[[:space:]]+" and "[[:space:]]+$" replacement lines. You could replace those 2 with a str_trim. But if you still have spaces when you get to the unique column names, make the fix upstream.