Search code examples
rlistdataframedplyrtwitter

Screen Names from Twitter into DataFrame - R


I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user @sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.

Tweets <-  search_tweets("@sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)@[^\\s]+")

This give me a List object with the every screen name of each text's tweet.

The first question is: How i get a data frame whith the following estructure?

X1 X2 X3 X4 X5 ... Xn
@sernac @vtrchile NA NA NA NA NA
@username @playstation @taylorswitft @elonmusk @instagram NA NA
@username2 @username5 @selenagomez @username2 @username3 @FIFA @xbox
@username4 @ebay NA NA NA NA NA

Where the numbers of columns is equal to the max number of elements in a object from the list.

I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.

df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))

After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.

This is an example of the database created by me and the final desired result:

CLUSTER DATA FRAME

screen_name cluster
@sernac Gov
@playstation Videogames
@walmart Supermarket
@SelenaGomez Celebrity
@elonmusk Celebrity
@xbox Videogames
@ebay Ecommerce

FINAL RESULT

X1 X2 X3 X4 X5 ... Xn cluster
@sernac @vtrchile NA NA NA NA NA Gov
@username @playstation @taylorswitft @elonmusk @instagram NA NA Videogames
@username2 @username5 @selenagomez @username2 @username3 @FIFA @xbox Celebrity
@username4 @ebay NA NA NA NA NA Ecommerce

I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.


Solution

  • I would approach this differently.

    First, if you are trying to download as many tweets as possible, set n = Inf and retryonratelimit = TRUE:

    Tweets <-  search_tweets("@sernac", 
                             n = Inf, 
                             include_rts = FALSE, 
                             retryonratelimit = TRUE)
    

    Second, there is no need to extract screen names from the tweet text, as this information can be found in the entities column.

    One way to extract mentions is to use lapply. You can then create a data frame with just the useful columns, and convert screen names to lower case for matching.

    library(dplyr)
    
    mentions <- lapply(Tweets$entities, function(x) x$user_mentions) %>%
      bind_rows(.id = "tweet_number") %>%
      select(tweet_number, screen_name) %>%
      mutate(screen_name_lc = tolower(screen_name))
    
    head(mentions)
    
      tweet_number    screen_name screen_name_lc
    1            1 mundo_pacifico mundo_pacifico
    2            1       OIMChile       oimchile
    3            1   subtel_chile   subtel_chile
    4            1 ReclamosSubtel reclamossubtel
    5            1         SERNAC         sernac
    6            2 mundo_pacifico mundo_pacifico
    

    Next, add a column with the lower-case screen names to your cluster data:

    cluster_df <- cluster_df %>% 
      mutate(screen_name_lc = str_replace(screen_name, "@", "") %>% 
             tolower())
    

    Now we can join the data frames, just on the screen_name_lc column:

    mentions_clusters <- mentions %>% 
      left_join(cluster_df, 
                by = "screen_name_lc") %>% 
      select(tweet_number, screen_name = screen_name.x, cluster)
    
    head(mentions_clusters)
    
      tweet_number    screen_name cluster
    1            1 mundo_pacifico    <NA>
    2            1       OIMChile    <NA>
    3            1   subtel_chile    <NA>
    4            1 ReclamosSubtel    <NA>
    5            1         SERNAC     Gov
    6            2 mundo_pacifico    <NA>
    

    This "long" format is much easier to work with for subsequent analysis than the "wide" format, and can still be grouped by tweet using the tweet_number column.

    Data for cluster_df:

    cluster_df <- structure(list(screen_name = c("@sernac", "@playstation", "@walmart", 
    "@SelenaGomez", "@elonmusk", "@xbox", "@ebay"), cluster = c("Gov", 
    "Videogames", "Supermarket", "Celebrity", "Celebrity", "Videogames", 
    "Ecommerce"), screen_name_lc = c("sernac", "playstation", "walmart", 
    "selenagomez", "elonmusk", "xbox", "ebay")), class = "data.frame", row.names = c(NA, 
    -7L))