Search code examples
rfunctionnlpsubstring

Substring evaluation of word components in R for NLP


I'm trying to do some string evaluations on given words such that the output is a list of the components of the word in 2 letter combinations.

Eg

'House' becomes 'ho','ou','us','se'

Producing this outcome is relatively easy using 'substr' as below:

y= 'house'

substr(y, start = 1, stop = 2)
substr(y, start = 2, stop = 3)
substr(y, start = 3, stop = 4)
substr(y, start = 4, stop = 5)

What I would like to be able to do however, is do this almost recursively so that any word of any length will be outputted to its component 2 letter combinations.

So 'Motorcar' become 'mo','ot','to','or','rc','ca','ar'. Etc Etc.

Is there a way this can perhaps be done using loops or a function? Does the lenght of the word need to be a condition of the function?

Any thoughts greatly appreciated.


Solution

  • We can use substring :

    get_string <- function(x) {
       inds <- seq_len(nchar(x))
       start = inds[-length(inds)]
       stop = inds[-1]
       substring(x, start, stop)
    }
    
    get_string('House')
    #[1] "Ho" "ou" "us" "se"
    
    get_string('Motorcar')
    #[1] "Mo" "ot" "to" "or" "rc" "ca" "ar"