Search code examples
rstrsplit

R programming strsplit(): Undesired result


I want to split a text and I am following the example 1:

Example 1:

> x <- "Split the words in a sentence."
> strsplit(x, " ")

[[1]]
[1] "Split"     "the"       "words"     "in"       
[5] "a"         "sentence."

So I am trying to split the NewString:

> NewString
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "
> strsplit(NewString,' ')
 [[1]]
 [1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "

The function does not split the text.The strange is that if copy the output of NewString and paste it to the strsplit():

 >strsplit("s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 ",' ')
 [[1]]
 [1] "s14" "v13" "s13" "s13" "v12" "s12" "v11" "s11" "v10" "s10" "s10" "v09" "s09"
 [14] "v08" "s08" "v07" "s07" "v06" "s06" "v05" "s05" "v04" "s04" "v03" "s03" "v02"
 [27] "s02" "s01" "v00"

What could be the problem?

( The NewString is outputed by using rvest package)

Edit: CharToRaw gives the following output:

> charToRaw(lol)
 [1] 73 31 34 c2 a0 76 31 33 c2 a0 73 31 33 c2 a0 73 31 33 c2 a0 76 31 32 c2 a0
 [26] 73 31 32 c2 a0 76 31 31 c2 a0 73 31 31 c2 a0 76 31 30 c2 a0 73 31 30 c2 a0
 [51] 73 31 30 c2 a0 76 30 39 c2 a0 73 30 39 c2 a0 76 30 38 c2 a0 73 30 38 c2 a0
 [76] 76 30 37 c2 a0 73 30 37 c2 a0 76 30 36 c2 a0 73 30 36 c2 a0 76 30 35 c2 a0
[101] 73 30 35 c2 a0 76 30 34 c2 a0 73 30 34 c2 a0 76 30 33 c2 a0 73 30 33 c2 a0
[126] 76 30 32 c2 a0 73 30 32 c2 a0 73 30 31 c2 a0 76 30 30 c2 a0

Solution

  • This can be done using the stringi package and stri_split.

    First lets make a string separated by the same chars (194/160 is C2A0 in hex):

    s=rawToChar(as.raw(c(65,66,48,194, 160,65,67,49,194,160,65,68,50)))
    
    > s
    [1] "AB0 AC1 AD2"
    

    Ordinary str_split doesn't work:

    > str_split(s,"\\s+")
    [[1]]
    [1] "AB0 AC1 AD2"
    

    But install stringi and:

    > stri_split(s,regex="\\s+")
    [[1]]
    [1] "AB0" "AC1" "AD2"
    

    I suspect stringi has a wider concept of what whitespace (\s) is.