Search code examples
dplyrapache-spark-sqlsparklyr

Count the number of characters of the first, second and third word in a string


I need to understand develop a code that can count the number of characters of the second and third word in a string.

I got this code, but it just work for the number of characters of the first word.

Now I am just allowed to use Spark SQL or dplyr package.

This is what I made for count characters in the first word

INSTR(NAME_NORM_LONG,' ')-1)

The expected result it is to count the characters and display the result in a new column.

word="hey I am Scott"

characters_word1 | characters_word2 | characters_word3 

          3               1                   2

Now I am running this code for testing (trying to locate the second word):

test_query<-test_query %>% 
mutate(Total_char=nchar(NAME_NORM_LONG))%>%
mutate(Name_has_numbers=str_detect(NAME_NORM_LONG,"[[:digit:]]"))%>%
mutate(number_words=LENGTH(NAME_NORM_LONG) - LENGTH(REPLACE(NAME_NORM_LONG, ' ', '')) + 1)%>%
mutate(number_chars_w1=INSTR(NAME_NORM_LONG,' ')-1)%>%
mutate(number_chars_w2=substr(NAME_NORM_LONG,number_chars_w1+1,LENGTH(NAME_NORM_LONG)))``` and the result I am having is this one ```test_query
# Source: spark<?> [?? x 7]
   PBIN0 NAME_NORM_LONG Total_char Name_has_numbers number_words number_chars_w1
   <int> <chr>               <int> <lgl>                   <dbl>           <dbl>
1 4.01e8 GM BUILDERS            11 FALSE                       2               2
# … with 1 more variable: number_chars_w2 <chr>
Warning messages:
1: In substr(NAME_NORM_LONG, number_chars_w1, 1) :
  NAs introduced by coercion
2: In substr(NAME_NORM_LONG, number_chars_w1, 1) :
  NAs introduced by coercion
3: In substr(NAME_NORM_LONG, number_chars_w1, 1) :
  NAs introduced by coercion
4: In substr(NAME_NORM_LONG, number_chars_w1, 1) :
  NAs introduced by coercion
5: In substr(NAME_NORM_LONG, number_chars_w1, 1) :
  NAs introduced by coercion```

Solution

  • How about using str_split()?

    word="hey I am Scott"
    
    word_list = stringr::str_split(word, " ")
    
    n = length(word_list[[1]])
    for (i in 1:n){
      first_row = paste0("characters_word", 1:n)
      second_row = sapply(word_list[[1]], nchar)
    }
    
    df = data.frame(first_row, second_row)