I have a very large dataframe (750,000+ rows) and I need to change the data in there from:
V W X Y Z
1 1085_S4_kaiju.out 0.697392 418154 28901 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica;
2 1085_S4_kaiju.out 0.692621 415293 562 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;
3 1085_S4_kaiju.out 0.690187 413834 470 Bacteria;Pseudomonadota;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii;
4 1085_S4_kaiju.out 0.625289 374921 573 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae;
5 1085_S4_kaiju.out 0.285564 171223 33069 Bacteria;Pseudomonadota;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas viridiflava;
...
to:
V W X Y Z
1 1085_S4_kaiju.out 0.697392 418154 28901 Salmonella enterica
2 1085_S4_kaiju.out 0.692621 415293 562 Escherichia coli
3 1085_S4_kaiju.out 0.690187 413834 470 Acinetobacter baumannii
4 1085_S4_kaiju.out 0.625289 374921 573 Klebsiella pneumoniae
5 1085_S4_kaiju.out 0.285564 171223 33069 Pseudomonas viridiflava
...
Essentially taking the last column, splitting string based on ";", and returning second last value. Since some of the rows have values in column Z which are not of the same format as the rest, I want to remove those rows from the dataframe.
The following does the job, but takes FOREVER
library(stringr)
row_to_drop = c()
for(i in 1:nrow(tb)) {
if (length(str_split(tb[i, 5], ";")[[1]]) < 8) {
row_to_drop = c(row_to_drop, i)
} else {
tb[i, 5] = str_split(tb[i, 5], ";")[[1]][[7]]
}
}
I have tried to look into vectorization but Im finding applying it to what Im trying to do here very confusing. Can I get some help on how to perform the changes to the dataframe I want much faster?
thanks
Using stringr::str_extract()
:
library(stringr)
tb$Z <- str_extract(tb$Z, "[^;]*(?=;$)")
Result:
V W X Y Z
1 1085_S4_kaiju.out 0.697392 418154 28901 Salmonella enterica
2 1085_S4_kaiju.out 0.692621 415293 562 Escherichia coli
3 1085_S4_kaiju.out 0.690187 413834 470 Acinetobacter baumannii
4 1085_S4_kaiju.out 0.625289 374921 573 Klebsiella pneumoniae
5 1085_S4_kaiju.out 0.285564 171223 33069 Pseudomonas viridiflava
"[^;]*"
matches sequential characters other than ";"
. "(?=;$)"
matches characters immediately preceding the final ";"
. When combined, they match all text between the last two ";"
s.