Search code examples
rdataframevectorizationstringr

How to quickly iterate over rows in a very large dataframe to apply str_split and extract second last element to replace value in dataframe


I have a very large dataframe (750,000+ rows) and I need to change the data in there from:

                    V        W      X       Y                                                          Z
1   1085_S4_kaiju.out 0.697392 418154   28901                                          Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica;

2   1085_S4_kaiju.out 0.692621 415293     562                                           Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;

3   1085_S4_kaiju.out 0.690187 413834     470                                    Bacteria;Pseudomonadota;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii;

4   1085_S4_kaiju.out 0.625289 374921     573                                      Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae;

5   1085_S4_kaiju.out 0.285564 171223   33069                                     Bacteria;Pseudomonadota;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas viridiflava;
...

to:

                    V        W      X       Y                                                          Z
1   1085_S4_kaiju.out 0.697392 418154   28901                                        Salmonella enterica
2   1085_S4_kaiju.out 0.692621 415293     562                                           Escherichia coli
3   1085_S4_kaiju.out 0.690187 413834     470                                    Acinetobacter baumannii
4   1085_S4_kaiju.out 0.625289 374921     573                                      Klebsiella pneumoniae
5   1085_S4_kaiju.out 0.285564 171223   33069                                    Pseudomonas viridiflava
...

Essentially taking the last column, splitting string based on ";", and returning second last value. Since some of the rows have values in column Z which are not of the same format as the rest, I want to remove those rows from the dataframe.

The following does the job, but takes FOREVER

library(stringr)

row_to_drop = c()
for(i in 1:nrow(tb)) {
  if (length(str_split(tb[i, 5], ";")[[1]]) < 8) {
    row_to_drop = c(row_to_drop, i)
  } else {
      tb[i, 5] = str_split(tb[i, 5], ";")[[1]][[7]]
  }
}

I have tried to look into vectorization but Im finding applying it to what Im trying to do here very confusing. Can I get some help on how to perform the changes to the dataframe I want much faster?

thanks


Solution

  • Using stringr::str_extract():

    library(stringr)
    
    tb$Z <- str_extract(tb$Z, "[^;]*(?=;$)")
    

    Result:

                      V        W      X     Y                       Z
    1 1085_S4_kaiju.out 0.697392 418154 28901     Salmonella enterica
    2 1085_S4_kaiju.out 0.692621 415293   562        Escherichia coli
    3 1085_S4_kaiju.out 0.690187 413834   470 Acinetobacter baumannii
    4 1085_S4_kaiju.out 0.625289 374921   573   Klebsiella pneumoniae
    5 1085_S4_kaiju.out 0.285564 171223 33069 Pseudomonas viridiflava
    

    "[^;]*" matches sequential characters other than ";". "(?=;$)" matches characters immediately preceding the final ";". When combined, they match all text between the last two ";"s.