Search code examples
rnlptidyverse

Fill in word that letter is located in


I am processing keystroke data, and need to find the word that a keystroke is located within. Because there can be invisible keystrokes (like Shift) or deleted keystrokes, this is not a trivial problem where I can just iterate the index of keystrokes, and locate the word. Rather, I need to find the space-delimited word that the keystroke is produced within. I do have the full text and existing text available, which I should be able to leverage. I've tried solutions using fill(), lag(), and cumsum(), but none are working.

I have a dataframe like the below, where I group by experiment_id:

x <- tibble(
  experiment_id = rep(c('1a','1b'),each=12),
  keystroke = rep(c('a','SPACE','SHIFT','b','e','DELETE','a','d','SPACE','m','a','n'),2),
  existing_text = rep(c('a','a ','a ','a B','a Be','a B','a Ba','a Bad','a Bad ',
                    'a Bad m','a Bad ma','a Bad man'),2),
  final_text = 'a Bad man'
)

The additional column should look like this, where SPACE belongs to the word it follows, and DELETEs and the deleted keystrokes are part of the final word:

within_word = c('a','a','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','BeDELETEad','man','man','man')

Is there a way to derive this?

EDIT FOR ADDITIONAL HELP: In the comments below the answer, @Onyambu mentioned that there is a simpler solution using the keystroke column. I've found that in my larger, more complex data that existing_text is not always reliable. I would strongly prefer a solution that relies on keystroke primarily. I've also added in complications due to deletions.


Solution

  • Below are two approaches:

    The first uses the information in existing_text only for the grouping and constructs the within_words columns based on this grouping and keystroke.

    The second approach uses only the information in keystroke.


    First approach: grouping based on existing_text and content based on keystroke:

    We take three steps:

    First, we caclulate the grouping based on strsplit where we look for spaces \\s that are preceeded by words \\w. We need to correct the values for "SHIFT" since they should be counted to the word after "SPACE".

    Step two is the replace "SHIFT" (and all other similar functions which the example data doesn't contain) with "".

    Third, we collapse the strings with paste0(..., collapse = "").

    library(tidyverse)
    
    x %>%
    
      # step1: construct grouping:
      mutate(word_grp = lengths(strsplit(existing_text, "(?<=\\w)\\s", perl = TRUE)) %>% 
               if_else(keystroke == "SHIFT", lead(., default = last(.)), .)) %>%
      group_by(experiment_id, word_grp) %>% 
    
      # step 2 & 3: first replace keys like "SHIFT" with "", the collapse with `paste0`
      mutate(within_word = str_replace_all(keystroke, c("SHIFT" = "", "SPACE" = "")) %>% 
               paste0(., collapse = ""))
    
    #> # A tibble: 24 x 6
    #> # Groups:   experiment_id, word_grp [6]
    #>    experiment_id keystroke existing_text final_text word_grp within_word
    #>    <chr>         <chr>     <chr>         <chr>         <int> <chr>      
    #>  1 1a            a         "a"           a Bad man         1 a          
    #>  2 1a            SPACE     "a "          a Bad man         1 a          
    #>  3 1a            SHIFT     "a "          a Bad man         2 beDELETEad 
    #>  4 1a            b         "a B"         a Bad man         2 beDELETEad 
    #>  5 1a            e         "a Be"        a Bad man         2 beDELETEad 
    #>  6 1a            DELETE    "a B"         a Bad man         2 beDELETEad 
    #>  7 1a            a         "a Ba"        a Bad man         2 beDELETEad 
    #>  8 1a            d         "a Bad"       a Bad man         2 beDELETEad 
    #>  9 1a            SPACE     "a Bad "      a Bad man         2 beDELETEad 
    #> 10 1a            m         "a Bad m"     a Bad man         3 man        
    #> # … with 14 more rows
    


    Second approach: based on information in keystrokes only.

    Here is one approach which only uses the information in keystroke. However, if we only want to use the data in keystroke things get much more laborious.

    Here is a short explanation of the steps taken below:

    Step 1a: data cleaning
    We need to clean the data in keystrokes so that they can be used for the new column within_word. This means two things: (a) we need to replace every keystroke that should not be printed in within_word with "". And before this we need to (b) change the leading keystroke based on the function of that key. In the case of SHIFT this means we need to set the leading keystroke toupper. For your example data this is really simple, because there is only SHIFT we need to take care of. However, in your real data there might be many similar other keys such as ALT or ^. So we need to repeat Step 1a for each key. Ideally we would come up with a function taking the name of the key and the function that it uses on the leading keystroke. Note that we do not yet include "SPACE" in this step, since we need it in Step 2.

    To see how many keys you need to take care of in your actual data we can filter for those keystrokes that don't change the existing_text. In your example data this is only SHIFT:

    # get all keystrokes that don't change the existing_text directly
    x %>% 
      select(keystroke, existing_text) %>% 
      filter(existing_text == lag(existing_text, default = ""))
    
    #> # A tibble: 2 x 2
    #>   keystroke existing_text
    #>   <chr>     <chr>        
    #> 1 SHIFT     "a "         
    #> 2 SHIFT     "a "
    

    Step 2: create grouping
    We need to create the grouping of the words in within_text. This is the most complicated step. Below we first look for rows where within_word == "SPACE" and which succeeding row is != "SPACE". We use data.table::rleid on the result to get a run-length id for this variable. Finally we need to subtract 1 for those rows which within_word == "SPACE".

    Step 3: data prep before final step
    This is basically similar to step 1a, we need to replace "SPACE" with "" because we don't want it in our result. However, since we needed this column for step 2 we have to finalize the data cleaning in this step.

    Step 4: collapse the strings in within_word
    Finally, we group by experiment_id and by word_grp and collapse the strings in within_word with paste0(..., collapse = "").

    library(tidyverse)
    
      # step 1a: data cleaning
      mutate(within_word = if_else(lag(keystroke, default = first(keystroke)) == "SHIFT",
                                   toupper(keystroke),
                                   keystroke) %>%
                              str_replace_all(., c("SHIFT" = ""))) %>%  
     
      # step 1b to 1n: repeat step 1a for other keys like ALT, ^ etc. 
    
      # step 2: create groups
      group_by(experiment_id) %>% 
      mutate(word_grp = data.table::rleid(
          within_word == "SPACE" & lead(within_word, default = first(keystroke)) != "SPACE"
        ) %>% if_else(within_word == "SPACE", . - 1L, .)) %>% 
    
      # step 3: data prep before final step
      ungroup %>% 
      mutate(within_word = str_replace(within_word, "SPACE", "")) %>%
     
      # step 4: collapse
      group_by(experiment_id, word_grp) %>% 
      mutate(within_word = paste0(within_word, collapse = ""))
    
    #> # A tibble: 24 x 6
    #> # Groups:   experiment_id, word_grp [6]
    #>    experiment_id keystroke existing_text final_text within_word word_grp
    #>    <chr>         <chr>     <chr>         <chr>      <chr>          <int>
    #>  1 1a            a         "a"           a Bad man  a                  1
    #>  2 1a            SPACE     "a "          a Bad man  a                  1
    #>  3 1a            SHIFT     "a "          a Bad man  BeDELETEad         3
    #>  4 1a            b         "a B"         a Bad man  BeDELETEad         3
    #>  5 1a            e         "a Be"        a Bad man  BeDELETEad         3
    #>  6 1a            DELETE    "a B"         a Bad man  BeDELETEad         3
    #>  7 1a            a         "a Ba"        a Bad man  BeDELETEad         3
    #>  8 1a            d         "a Bad"       a Bad man  BeDELETEad         3
    #>  9 1a            SPACE     "a Bad "      a Bad man  BeDELETEad         3
    #> 10 1a            m         "a Bad m"     a Bad man  man                5
    #> # … with 14 more rows
    

    Created on 2021-12-23 by the reprex package (v0.3.0)