Screen Names from Twitter into DataFrame - R

I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user @sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.

Tweets <-  search_tweets("@sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)@[^\\s]+")

This give me a List object with the every screen name of each text's tweet.

The first question is: How i get a data frame whith the following estructure?

X1	X2	X3	X4	X5	...	Xn
@sernac	@vtrchile	NA	NA	NA	NA	NA
@username	@playstation	@taylorswitft	@elonmusk	@instagram	NA	NA
@username2	@username5	@selenagomez	@username2	@username3	@FIFA	@xbox
@username4	@ebay	NA	NA	NA	NA	NA

Where the numbers of columns is equal to the max number of elements in a object from the list.

I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.

df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))

After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.

This is an example of the database created by me and the final desired result:

CLUSTER DATA FRAME

screen_name	cluster
@sernac	Gov
@playstation	Videogames
@walmart	Supermarket
@SelenaGomez	Celebrity
@elonmusk	Celebrity
@xbox	Videogames
@ebay	Ecommerce

FINAL RESULT

X1	X2	X3	X4	X5	...	Xn	cluster
@sernac	@vtrchile	NA	NA	NA	NA	NA	Gov
@username	@playstation	@taylorswitft	@elonmusk	@instagram	NA	NA	Videogames
@username2	@username5	@selenagomez	@username2	@username3	@FIFA	@xbox	Celebrity
@username4	@ebay	NA	NA	NA	NA	NA	Ecommerce

I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.

Solution

I would approach this differently.

First, if you are trying to download as many tweets as possible, set n = Inf and retryonratelimit = TRUE:

Tweets <-  search_tweets("@sernac", 
                         n = Inf, 
                         include_rts = FALSE, 
                         retryonratelimit = TRUE)

Second, there is no need to extract screen names from the tweet text, as this information can be found in the entities column.

One way to extract mentions is to use lapply. You can then create a data frame with just the useful columns, and convert screen names to lower case for matching.

library(dplyr)

mentions <- lapply(Tweets$entities, function(x) x$user_mentions) %>%
  bind_rows(.id = "tweet_number") %>%
  select(tweet_number, screen_name) %>%
  mutate(screen_name_lc = tolower(screen_name))

head(mentions)

  tweet_number    screen_name screen_name_lc
1            1 mundo_pacifico mundo_pacifico
2            1       OIMChile       oimchile
3            1   subtel_chile   subtel_chile
4            1 ReclamosSubtel reclamossubtel
5            1         SERNAC         sernac
6            2 mundo_pacifico mundo_pacifico

Next, add a column with the lower-case screen names to your cluster data:

cluster_df <- cluster_df %>% 
  mutate(screen_name_lc = str_replace(screen_name, "@", "") %>% 
         tolower())

Now we can join the data frames, just on the screen_name_lc column:

mentions_clusters <- mentions %>% 
  left_join(cluster_df, 
            by = "screen_name_lc") %>% 
  select(tweet_number, screen_name = screen_name.x, cluster)

head(mentions_clusters)

  tweet_number    screen_name cluster
1            1 mundo_pacifico    <NA>
2            1       OIMChile    <NA>
3            1   subtel_chile    <NA>
4            1 ReclamosSubtel    <NA>
5            1         SERNAC     Gov
6            2 mundo_pacifico    <NA>

This "long" format is much easier to work with for subsequent analysis than the "wide" format, and can still be grouped by tweet using the tweet_number column.

Data for cluster_df:

cluster_df <- structure(list(screen_name = c("@sernac", "@playstation", "@walmart", 
"@SelenaGomez", "@elonmusk", "@xbox", "@ebay"), cluster = c("Gov", 
"Videogames", "Supermarket", "Celebrity", "Celebrity", "Videogames", 
"Ecommerce"), screen_name_lc = c("sernac", "playstation", "walmart", 
"selenagomez", "elonmusk", "xbox", "ebay")), class = "data.frame", row.names = c(NA, 
-7L))