Search code examples
rstringstring-comparison

how do I find differences between similar strings?


I have a vector of strings (file names to be exact).

pav <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_",
         "Sn_4Khz_4W_45_130_02_30cm_101mm_",
         "Sn_4Khz_4W_50_130_02_30cm_101mm_")

I'm looking for a simple way to find difference between these strings.

`> char_position_fun(pav) # gives unique character position
[1] 9 12 13 `


`> char_diff_fun(pav) # removes matching components (position and value)
[1] 3_4_5  4_4_5  4_5_0`

Solution

  • Here is my attempt. I decided to split all letters and create a data frame for each string containing position and letter information. Then, for each position, I checked if there is one unique letter or not. If FALSE, that suggests that not all letters are identical. Finally, subset the data frame with a logical condition. In this way, you can see position and letter information together.

    library(tidyverse)
    
    strsplit(mytext, split = "") %>% 
    map_dfr(.x = .,
            .f = function(x) enframe(x, name = "position", value = "word"),
            .id = "id") %>% 
    group_by(position) %>% 
    mutate(check = n_distinct(word) == 1) %>% 
    filter(check == FALSE)
    
      id    position word  check
      <chr>    <int> <chr> <lgl>
    1 1            9 3     FALSE
    2 1           12 4     FALSE
    3 1           13 5     FALSE
    4 2            9 4     FALSE
    5 2           12 4     FALSE
    6 2           13 5     FALSE
    7 3            9 4     FALSE
    8 3           12 5     FALSE
    9 3           13 0     FALSE
    

    If you want to have the outcome as you described, you can add a bit more operation.

    strsplit(mytext, split = "") %>% 
    map_dfr(.x = .,
            .f = function(x) enframe(x, name = "position", value = "word"),
            .id = "id") %>% 
    group_by(position) %>% 
    mutate(check = n_distinct(word) == 1) %>% 
    filter(check == FALSE) %>% 
    group_by(id) %>% 
    summarize_at(vars(position:word),
                 .funs = list(~paste0(., collapse = "_")))
    
      id    position word 
      <chr> <chr>    <chr>
    1 1     9_12_13  3_4_5
    2 2     9_12_13  4_4_5
    3 3     9_12_13  4_5_0
    

    DATA

    mytext <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_", "Sn_4Khz_4W_45_130_02_30cm_101mm_", 
    "Sn_4Khz_4W_50_130_02_30cm_101mm_")