I have a data frame made of texts taken from Wikipedia. An example would be:
dput(text3)
structure(list(texts = c("Apollo 13 was the seventh crewed mission in the Apollo space program and the third meant to land on the Moon. The craft was launched from Kennedy Space Center on April 11, 1970, but the lunar landing was aborted after an oxygen tank in the service module (SM) failed two days into the mission. The crew instead looped around the Moon, and returned safely to Earth on April 17. The mission was commanded by Lovell with Swigert as command module (CM) pilot and Haise as lunar module (LM) pilot. Swigert was a late replacement for Mattingly, who was grounded after exposure to rubella.",
"A routine stir of an oxygen tank ignited damaged wire insulation inside it, causing an explosion that vented the contents of both of the SM's oxygen tanks to space. Without oxygen, needed for breathing and for generating electric power, the SM's propulsion and life support systems could not operate. The CM's systems had to be shut down to conserve its remaining resources for reentry, forcing the crew to transfer to the LM as a lifeboat. With the lunar landing canceled, mission controllers worked to bring the crew home alive. ",
"Although the LM was designed to support two men on the lunar surface for two days, Mission Control in Houston improvised new procedures so it could support three men for four days. The crew experienced great hardship caused by limited power, a chilly and wet cabin and a shortage of potable water. There was a critical need to adapt the CM's cartridges for the carbon dioxide removal system to work in the LM; the crew and mission controllers were successful in improvising a solution. The astronauts' peril briefly renewed interest in the Apollo program; tens of millions watched the splashdown in the South Pacific Ocean on television."
), paragraph = c("p1", "p2", "p3"), source = c("wiki", "wiki",
"wiki"), autronauts = c("Lovell", "Swigert", "Haise")), row.names = c(NA,
-3L), class = "data.frame")
In my research I need to study the people in the articles by their social role, the actual names do not interest me. So I would need to substitute each name by a unique social indicator.
Lovell = @Astronaut1
Swigert = @Austronaut2
Haise = @Autronaut3
Mattingly = @Austronaut4
a01 <- c('Lovell', 'Swigert', 'Haise' ,'Mattingly')
a02 <- c('@Astronaut1', '@Austronaut2', '@Autronaut3', '@Austronaut4')
Since I have to substitute the string in the two columns and keep the data frame format, tried and failed:
library(stringi)
text3$texts <- stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02)
Error in `$<-.data.frame`(`*tmp*`, texts, value = c("Apollo 13 was the seventh crewed mission in the Apollo space program and the third meant to land on the Moon. The craft was launched from Kennedy Space Center on April 11, 1970, but the lunar landing was aborted after an oxygen tank in the service module (SM) failed two days into the mission. The crew instead looped around the Moon, and returned safely to Earth on April 17. The mission was commanded by @Astronaut1 with Swigert as command module (CM) pilot and Haise as lunar module (LM) pilot. Swigert was a late replacement for Mattingly, who was grounded after exposure to rubella.", :
replacement has 4 rows, data has 3
In addition: Warning message:
In stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02) :
longer object length is not a multiple of shorter object length
and
text3$astronauts <- stri_replace_all_fixed(str = text3$astronauts, pattern = a01, replacement = a02)
Error in `$<-.data.frame`(`*tmp*`, astronauts, value = c("@Astronaut1", :
replacement has 4 rows, data has 3
In addition: Warning message:
In stri_replace_all_fixed(str = text3$astronauts, pattern = a01, :
longer object length is not a multiple of shorter object length
Any help would be lovely
The error received with:
stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02)
comes from a vectorized approach (the default). See ?stringi-arguments
:
Almost all functions are vectorized with respect to all their arguments and the recycling rule is applied whenever necessary. Due to this property you may, for instance, search for one pattern in each given string, search for each pattern in one given string, and search for the i-th pattern within the i-th string. This behavior sometimes leads to peculiar results - we assume you know what you are doing.
Because of this, the result will be 4 string objects:
[1] first row in texts
substitute with first pattern/replacement (Astronaut1)
[2] second row in texts
substitute with second pattern/replacement (Astronaut2)
[3] third row in texts
substitute with third pattern/replacement (Astronaut3)
[4] first row in texts
(recycled) substitute with fourth pattern/replacement (Astronaut4)
And since returning object length 4, this is greater than the 3 strings you are trying to replace that you started with in text3$texts
, causing the error.
To get around this, set vectorize_all = FALSE
:
stri_replace_all_fixed(str = text3$texts, pattern = a01, replacement = a02, vectorize_all = FALSE)
Which should return 3 strings and substitute all replacements following all patterns in each of the 3 strings.