Search code examples
rstringdata-cleaningstringr

Keep the first 4 words in a column


I'm trying to only keep the first 4 words of a column in my data and still want to keep the other observations that have less than 4 words.

This is a sample of what some of the data looks like.

State Company Number of workers
X FAIRFIELD NURSING AND REHABILITATION CENTER, 99
Y ATHENAHEALTH 24
Z DRS TEST & ENERGY MANAGEMENT, 1009
W AMERICAN APPAREL 376
C BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES 67
A TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE 456

I've used the following code

library(stringr)

df$Company1 <- word(df$Company, 1, 4)

While this is providing column of 4 word company names, this is not working for me because it is getting rid of the companies that have less than 4 words returning NA for those instead.

So I'm hoping to find a solution to keep every observations that has 1 to 4 words.


Solution

  • You may do that following below.

    1. split Company using str_split() in stringr.
    2. paste each rows with apply()
    3. remove whitespace of right side.
    library(stringr)
    
    df <- data.frame(
      State = c("X","Y","Z","W","C","A"),
      Company = c("FAIRFIELD NURSING AND REHABILITATION CENTER",    
      "ATHENAHEALTH",   
      "DRS TEST & ENERGY MANAGEMENT",   
      "AMERICAN APPAREL",   
      "BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES",
      "TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE"),
      number_of_workers = c(99,24,1009,376,67, 456))
    
    df$Company1 <- str_split(df$Company," ", simplify = T)[,1:4] |> 
      apply(1, paste, collapse=" ") |> 
      trimws(which = "right")
    

    output

    [1] "FAIRFIELD NURSING AND REHABILITATION"
    [2] "ATHENAHEALTH"                        
    [3] "DRS TEST & ENERGY"                   
    [4] "AMERICAN APPAREL"                    
    [5] "BERRY PLASTICSPANY -ALENCE SPECIALTY"
    [6] "TUSCALOOSA RESOURCES , SWANN'S"
    

    Created on 2023-04-28 with reprex v2.0.2