Search code examples
filtergreplpartial-matches

Is there an R function for passing one variable through another variable to find partial matches?


I have a dataset that contains a "platform" variable which indicates the platform of the social media post and a "url" variable which contains the url for that post. I want to write a code to validate that the url belongs to the correct post. I plan on doing this by searching for the platform name withing the url amongst other things as I know this does not guarantee an exact match.

Is there an R function for passing one variable through another variable to find partial matches? I have found many examples of how to search for a specific or partial string within a variable, but I want to take all of the values of one variable and pass them through all of the values of another variable to find partial matches for every single record in my data set. Ideally, I would like to do this in 1 step and then produce the output for all the records where there isn't a partial match. Any help is much appreciated!

The first block of code is example data.

The second block of code indicates how I tried and imagined it would work, but have failed to actually make work.

The third block of code shows how I have figured out a way to validate it in a very roundabout way which involved making a new variable for every single category of the platform variable and confirming that it was true / making sure there were 0 false cases.

platform <- c("Facebook", "Instagram", "Twitter", "YouTube") #Example of the categories that the platform variable contains

url <- c("https://www.facebook.com/AnimalPlanet/photos/a.63789578374/10159416079873375/",
"https://www.facebook.com/AnimalPlanet/photos/a.63789578374/10159416141828375/",
"https://twitter.com/ScienceChannel/status/1564694529168531457?cxt=HHwWgoCzkcvY9LYrAAAA",
"https://www.instagram.com/p/ChnouaXrSDE/", 
"https://www.instagram.com/p/Che3LVcvkpr/", 
"https://www.instagram.com/p/ChLBudQlN3D/") #Example urls
dataset$platform_r <- str_replace_all(dataset$platform, " ", "") #removes spaces from variable categories

dataset$platform_r <- tolower(dataset$platform_r) #makes all characters lowercase

dataset %>% filter(!(grepl(dataset$platform_r, dataset$url))
dataset$facebookurl <- grepl("facebook", dataset$url) 
dataset$instagramurl <- grepl("instagram", dataset$url)
dataset$twitterurl <- grepl("twitter", dataset$url)
dataset$youtubeurl <- grepl("youtube", dataset$url)
dataset$blogurl <- grepl("blog", dataset$url)

table(dataset$platform)

dataset %>% filter(platform == "Facebook" & facebookurl == "TRUE")
dataset %>% filter(platform == "Instagram" & instagramurl == "TRUE")
dataset %>% filter(platform == "Twitter" & twitterurl == "TRUE")
dataset %>% filter(platform == "Blog" & blogurl == "TRUE")

dataset %>% filter(platform == "Facebook" & facebookurl == "FALSE")
dataset %>% filter(platform == "Instagram" & instagramurl == "FALSE")
dataset %>% filter(platform == "Twitter" & twitterurl == "FALSE")
dataset %>% filter(platform == "Blog" & blogurl == "FALSE")

Solution

  • I believe I have somewhat solved my own question / greatly reduced the code. Instead of using Grepl, I have instead decided to use str_detect to create a new Boolean "url_test" variable. This way I can filter url_test for any observations that have failed to contain a matching string and are 'FALSE".

    dataset$platform_r <- str_replace_all(dataset$platform, " ", "") #removes spaces from variable categories
    
    dataset$platform_r <- tolower(dataset_r$platform_r) #makes all characters lowercase
    
    dataset <- dataset %>% mutate(url_test = str_detect(dataset$url, dataset$platform_r))
                                                          
    dataset %>% filter(url_test == "FALSE") %>%
      select(platform, url, url_test)