Search code examples
rregextaxonomy

Removing first word from data frame cell when it starts with lowercase letter in R


I want to clean up a taxonomy table with bacterial species in R and I want to delete values from all cells that start with the small letter.

I have a column from taxonomy df:

Species
Tuwongella immobilis
Woesebacteria
unidentified marine
bacterium Ellin506

And I want:

Species
Tuwongella immobilis
Woesebacteria
unwanted <- "^[:upper:]+[:lower:]+"
tax.clean$Species <- str_replace_all(tax.clean$Species, unwanted, "")

but it doesn't seem to work and does not match desired species.


Solution

  • If you are working with dataframe, I suggest using dplyr::filter to clean up the dataframe.

    grepl() returns logical values, !grepl(^[[:lower:]]) looks for anything that does not start with a lower case letter (^ indicate the beginning of a string).

    library(dplyr)
    
    df %>% filter(!grepl("^[[:lower:]]", Species))
    
                   Species
    1 Tuwongella immobilis
    2        Woesebacteria