I'm trying to only keep the first 4 words of a column in my data and still want to keep the other observations that have less than 4 words.
This is a sample of what some of the data looks like.
State | Company | Number of workers |
---|---|---|
X | FAIRFIELD NURSING AND REHABILITATION CENTER, | 99 |
Y | ATHENAHEALTH | 24 |
Z | DRS TEST & ENERGY MANAGEMENT, | 1009 |
W | AMERICAN APPAREL | 376 |
C | BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES | 67 |
A | TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE | 456 |
I've used the following code
library(stringr)
df$Company1 <- word(df$Company, 1, 4)
While this is providing column of 4 word company names, this is not working for me because it is getting rid of the companies that have less than 4 words returning NA for those instead.
So I'm hoping to find a solution to keep every observations that has 1 to 4 words.
You may do that following below.
Company
using str_split()
in stringr
.apply()
library(stringr)
df <- data.frame(
State = c("X","Y","Z","W","C","A"),
Company = c("FAIRFIELD NURSING AND REHABILITATION CENTER",
"ATHENAHEALTH",
"DRS TEST & ENERGY MANAGEMENT",
"AMERICAN APPAREL",
"BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES",
"TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE"),
number_of_workers = c(99,24,1009,376,67, 456))
df$Company1 <- str_split(df$Company," ", simplify = T)[,1:4] |>
apply(1, paste, collapse=" ") |>
trimws(which = "right")
output
[1] "FAIRFIELD NURSING AND REHABILITATION"
[2] "ATHENAHEALTH"
[3] "DRS TEST & ENERGY"
[4] "AMERICAN APPAREL"
[5] "BERRY PLASTICSPANY -ALENCE SPECIALTY"
[6] "TUSCALOOSA RESOURCES , SWANN'S"
Created on 2023-04-28 with reprex v2.0.2