I have a data frame sp
which contains several species names but as they come from different databases, they are written in different ways.
For example, one specie can be called Urtica dioica and Urtica dioica L..
To correct this, I use the following code which extracs only the two first words from a row:
paste(strsplit(sp[i,"sp"]," ")[[1]][1],strsplit(sp[i,"sp"]," ")[[1]][2],sep=" ")
For now, this code is integrated in a for
loop, which works but takes ages to finish:
for (i in seq_along(sp$sp)) {
sp[i,"sp2"] = paste(strsplit(sp[i,"sp"]," ")[[1]][1],
strsplit(sp[i,"sp"]," ")[[1]][2],
sep=" ")
}
If there a way to improve this basic code using vectors or an apply function?
You could just use vectorized regular expression functions:
library(stringr)
x <- c("Urtica dioica", "Urtica dioica L.")
> str_extract(string = x,"\\w+ \\w+")
[1] "Urtica dioica" "Urtica dioica"
I happen to have found stringr convenient here, but with the right regex for your specific data you could do this just as well with base functions like gsub
.