I'm trying to do what should be a simple lookup/update between data frames in R with no success. It's based on some Flickr data, where some cameras have a number of different names - and I want to convert each of these to a standard name. I've looked at similar answers here - but they don't seem to deal with the situation where I'm matching a substring, which could appear anywhere in the column.
I've put a simplified set of data here which illustrates my problem :
LookupDF <- data.frame(Testr=c("EOS DIGITAL REBEL XTI (EOS 400D, EOS KISS X)", "PowerShot S400 (Digital IXUS 400, IXY Digital 400)", "PowerShot A530", "PowerShot A2300", "PowerShot A720 IS", "PowerShot SD880 IS (Digital IXUS 870 IS, IXY Digital 920 IS, IXY 999)"))
A little tidyup + creating a new column (I decided to use the 1st camera name as the "Standard Name") :
LookupDF$StandardName <- sapply(strsplit(as.character(LookupDF$Testr),'\\('), "[", 1)
LookupDF$StandardName <- gsub("[[:space:]]*$","",LookupDF$StandardName)
LookupDF
Testr StandardName
1 EOS DIGITAL REBEL XTI (EOS 400D, EOS KISS X) EOS DIGITAL REBEL XTI
2 PowerShot S400 (Digital IXUS 400, IXY Digital 400) PowerShot S400
3 PowerShot A530 PowerShot A530
4 PowerShot A2300 PowerShot A2300
5 PowerShot A720 IS PowerShot A720 IS
6 PowerShot SD880 IS (Digital IXUS 870 IS, IXY Digital 920 IS, IXY 999) PowerShot SD880 IS
The data frame I'm trying to update (I know I can use NA in some way instead of Unknown - but it's included for clarity)
InputDF <- data.frame(Capture_Device = c("EOS DIGITAL REBEL XTI", "EOS 400D", "IXY Digital 920 IS", "PowerShot A530"), Standard = rep("Unknown", 4), stringsAsFactors=FALSE)
InputDF
Capture_Device Standard
1 EOS DIGITAL REBEL XTI Unknown
2 EOS 400D Unknown
3 IXY Digital 920 IS Unknown
4 PowerShot A530 Unknown
So, if the InputDF$Capture_Device appears anywhere in LookupDF$Testr, I want to update InputDF$Standard with the corresponding LookupDF$StandardName
Therefore, the required result should look like :
> InputDF
Capture_Device Standard
1 EOS DIGITAL REBEL XTI EOS DIGITAL REBEL XTI
2 EOS 400D EOS DIGITAL REBEL XTI
3 IXY Digital 920 IS PowerShot SD880 IS
4 PowerShot A530 PowerShot A530
I've tried things along these lines :
InputDF$Standard <- LookupDF[pmatch(InputDF$Capture_Device, LookupDF$Testr, duplicates.ok = TRUE),2] # Works for exact match - 1st/4th entries
InputDF$Standard <- LookupDF[charmatch(InputDF$Capture_Device, LookupDF$Testr),2] # Works for exact match at start => 1st/4th entries
InputDF$Standard <- LookupDF[agrep(InputDF$Capture_Device, LookupDF$Testr, max.distance=0.0),2] #error message below
Warning message:
In agrep(InputDF$Capture_Device, LookupDF$Testr, max.distance = 0) :
argument 'pattern' has length > 1 and only the first element will be used
So I only get a correct result where the searched-for string is at the start of the LookupDF$Testr string (or the only value), but not if it's elsewhere within the string. So, I can get InputDF rows 1 and 4 to work, but not 2 and 3.
Any ideas on how I might do this correctly ? Thanks, Frank
You could use
InputDF$Standard <- with(LookupDF, {
sapply(InputDF$Capture_Device, function(x) StandardName[grepl(x, Testr)])
})
which results in the InputDF
changes:
InputDF
# Capture_Device Standard
# 1 EOS DIGITAL REBEL XTI EOS DIGITAL REBEL XTI
# 2 EOS 400D EOS DIGITAL REBEL XTI
# 3 IXY Digital 920 IS PowerShot SD880 IS
# 4 PowerShot A530 PowerShot A530
However, in case there are two or more matches for one device it would be safer to wrap StandardName[grepl(x, Testr)]
in the above call with toString()
in order to make sure that you don't get a list result from sapply()
. This also allows for all matches to be shown in the "Standard" column.