Search code examples
rmatchlookupflickr

Updating a column in data.frame using lookup values from another data.frame - with substring matching


I'm trying to do what should be a simple lookup/update between data frames in R with no success. It's based on some Flickr data, where some cameras have a number of different names - and I want to convert each of these to a standard name. I've looked at similar answers here - but they don't seem to deal with the situation where I'm matching a substring, which could appear anywhere in the column.

I've put a simplified set of data here which illustrates my problem :

LookupDF <- data.frame(Testr=c("EOS DIGITAL REBEL XTI          (EOS 400D, EOS KISS X)",       "PowerShot S400 (Digital IXUS 400, IXY Digital 400)", "PowerShot A530", "PowerShot A2300", "PowerShot A720 IS", "PowerShot SD880 IS (Digital IXUS 870 IS, IXY Digital 920 IS, IXY 999)"))

A little tidyup + creating a new column (I decided to use the 1st camera name as the "Standard Name") :

LookupDF$StandardName <- sapply(strsplit(as.character(LookupDF$Testr),'\\('), "[", 1)
LookupDF$StandardName <- gsub("[[:space:]]*$","",LookupDF$StandardName)
LookupDF

                                                              Testr          StandardName
1                 EOS DIGITAL REBEL XTI          (EOS 400D, EOS KISS X) EOS DIGITAL REBEL XTI
2                    PowerShot S400 (Digital IXUS 400, IXY Digital 400)        PowerShot S400
3                                                        PowerShot A530        PowerShot A530
4                                                       PowerShot A2300       PowerShot A2300
5                                                     PowerShot A720 IS     PowerShot A720 IS
6 PowerShot SD880 IS (Digital IXUS 870 IS, IXY Digital 920 IS, IXY 999)    PowerShot SD880 IS

The data frame I'm trying to update (I know I can use NA in some way instead of Unknown - but it's included for clarity)

    InputDF <- data.frame(Capture_Device = c("EOS DIGITAL REBEL XTI", "EOS 400D", "IXY Digital 920 IS", "PowerShot A530"), Standard = rep("Unknown", 4), stringsAsFactors=FALSE)
    InputDF

         Capture_Device Standard
1 EOS DIGITAL REBEL XTI  Unknown
2              EOS 400D  Unknown
3    IXY Digital 920 IS  Unknown
4        PowerShot A530  Unknown

So, if the InputDF$Capture_Device appears anywhere in LookupDF$Testr, I want to update InputDF$Standard with the corresponding LookupDF$StandardName

Therefore, the required result should look like :

  > InputDF

Capture_Device              Standard
1 EOS DIGITAL REBEL XTI EOS DIGITAL REBEL XTI
2              EOS 400D EOS DIGITAL REBEL XTI
3    IXY Digital 920 IS    PowerShot SD880 IS
4        PowerShot A530        PowerShot A530

I've tried things along these lines :

InputDF$Standard <- LookupDF[pmatch(InputDF$Capture_Device, LookupDF$Testr, duplicates.ok = TRUE),2] # Works for exact match - 1st/4th entries
InputDF$Standard <- LookupDF[charmatch(InputDF$Capture_Device, LookupDF$Testr),2] # Works for exact match at start => 1st/4th  entries

InputDF$Standard <- LookupDF[agrep(InputDF$Capture_Device, LookupDF$Testr, max.distance=0.0),2] #error message below

Warning message:
  In agrep(InputDF$Capture_Device, LookupDF$Testr, max.distance = 0) :
  argument 'pattern' has length > 1 and only the first element will be used

So I only get a correct result where the searched-for string is at the start of the LookupDF$Testr string (or the only value), but not if it's elsewhere within the string. So, I can get InputDF rows 1 and 4 to work, but not 2 and 3.

Any ideas on how I might do this correctly ? Thanks, Frank


Solution

  • You could use

    InputDF$Standard <- with(LookupDF, {
        sapply(InputDF$Capture_Device, function(x) StandardName[grepl(x, Testr)])
    })
    

    which results in the InputDF changes:

    InputDF
    #          Capture_Device              Standard
    # 1 EOS DIGITAL REBEL XTI EOS DIGITAL REBEL XTI
    # 2              EOS 400D EOS DIGITAL REBEL XTI
    # 3    IXY Digital 920 IS    PowerShot SD880 IS
    # 4        PowerShot A530        PowerShot A530
    

    However, in case there are two or more matches for one device it would be safer to wrap StandardName[grepl(x, Testr)] in the above call with toString() in order to make sure that you don't get a list result from sapply(). This also allows for all matches to be shown in the "Standard" column.