Assume a character vector of company names where the names come in various forms. Here is a small version of 10,000 row data frame; it shows the desired second vector ("two.names").
structure(list(firm = structure(1:8, .Label = c("Carlson Caspers",
"Carlson Caspers Lindquist & Schuman P.A", "Carlson Caspers Vandenburgh Lindquist & Schuman P.A.",
"Carlson Caspers Vandenburgh & Lindquist", "Carmody Torrance",
"Carmody Torrance et al", "Carmody Torrance Sandak", "Carmody Torrance Sandak & Hennessey LLP"
), class = "factor"), = structure(c(1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), .Label = c("Carlson Caspers", "Carmody Torrance"
), class = "factor")), .Names = c("firm", ""), row.names = c(NA,
-8L), class = "data.frame")
1 Carlson Caspers Carlson Caspers
2 Carlson Caspers Lindquist & Schuman P.A Carlson Caspers
3 Carlson Caspers Vandenburgh Lindquist & Schuman P.A. Carlson Caspers
4 Carlson Caspers Vandenburgh & Lindquist Carlson Caspers
5 Carmody Torrance Carmody Torrance
6 Carmody Torrance et al Carmody Torrance
7 Carmody Torrance Sandak Carmody Torrance
8 Carmody Torrance Sandak & Hennessey LLP Carmody Torrance
Assume the vector has been sorted alphabetically by firm name (which I believe puts the shortest version first). How can I use agrep()
to start with the first company name, match it to the second and -- assuming a close match -- add the first company name to the new column ( for both of them. Then, match it to the third element, etc. All the Carlson variations would be matched.
If there is not a sufficient match, as when R encounters the first Carmody, start over with it and match to the next element, and so on until the next non-match.
If there is no match between consecutive companies, R should proceed until it finds a match.
The answer to this question uses fuzzy matching on the entire vector and groups by years. Create a unique ID by fuzzy matching of names (via agrep using R) It seems, however, to offer part of the code that would solve my problem. This question uses stringdist()
. stringdist
Below, the object matches
is a list that shows matches, but I don't know the code to tell R to "take the first one and convert the following matches, if any, to that name and put that name in the new variable column."
matches <- lapply(levels(df$firm), agrep, x=levels(df$firm), fixed=TRUE, value=FALSE)
I went and wrote it out in a for-loop, first defining the first line as a and then finding the matches, updating the dataframe and picking the next one to look for. That's what I meant by "do not try to solve this with a one-liner" - you have to make it work first in a much more verbose way, so you can understand what's going on. Then and ONLY if you NEED to, you can try to compress it into a oneliner.
firm.txt <- as.character(df$firm) <- firm.txt[1]
for (i in 2:length(firm.txt)) {
# i don't know how to write it any prettier
match <- agrep(, firm.txt)
if (length(match) > 0) {
df$[match] <-
i <- max(match) + 1 <- firm.txt[i]