Search code examples
rstringdataframetextfrequency

Counting overall word frequency when each sentence is a separate row in a dataframe


I have a single dataframe column which contains the names of people. Some of the names have only one word i.e. first name). Some name have two words i.e. first name and last name separated by a space. Some of the names have three words, first, middle and last names separated by space. Eg

Luke
Luke Skywalker
Walk Sky Luker
Walk Luke Syker 

A few names have four or more words. I want to find the frequency of each individual word e.g.

Luke 3
Walk 2
Sky 1
Skywalker 1
Luker 1
Skyer 1

How can I implement this using R? I have tried extracting words using stringr. I am able to separate words when they are in the form of a single block of text like a paragraph. But I am unable to separate words when each name in a row in separate a data frame. Any help?


Solution

  • You can just use table() on the unlisted strsplit() of your column

    table(unlist(strsplit(df$Words, " ")))
    
    # Luke     Luker       Sky Skywalker     Syker      Walk 
    #    3         1         1         1         1         2 
    

    and if you need it sorted

    sort(table(unlist(strsplit(df$Words, " "))), decreasing = TRUE)
    
    #     Luke      Walk     Luker       Sky Skywalker     Syker 
    #        3         2         1         1         1         1 
    

    where df$words is your column of interest.