I have compiled a dataset of tweets using the Twitter API.
The dataset basically looks as follows:
Data <- data.frame(
X = c(1,2),
text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
screenname = c("author1", "author2")
)
Now I want to create a data.frame
for social network analysis. I want to show how each of the screennames (in the case of this example "author1" etc.) is linked to users ("@User1" etc.) and hashtags ("#hashtag1", etc.).
To so, I need to extract/copy users and hashtags from the "text" column and write them in new columns. The data.frame
should look like this:
Data <- data.frame(
X = c(1,2),
text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
screenname = c("author1", "author2"),
U1 = c("@User1", "@User2"),
U2 = c("@User2", "@User1"),
U3 = c("@User3", "@User3"),
U4 = c("",""),
U5 = c("",""),
H1 = c("#hashtag1", "#hashtag3"),
H2 = c("#hashtag2", "#hashtag4"),
H3 = c("",""),
H4 = c("",""),
H5 = c("","")
)
How can I extract/copy this information from the "text" column and write it into new columns?
Here's my simple attempt using stringi
package. This method will create the amount of columns as the longest string in users and hastags, so this will work for any number of users or hashtags mentioned. This is also will be very efficient because this solution is mostly vectorized.
library(stringi)
Users <- stri_extract_all(Data$text, regex = "@[A-Za-z0-9]+")
Data[paste0("U", seq_len(max(sapply(Users, length))))] <- stri_list2matrix(Users, byrow = TRUE)
Hash <- stri_extract_all(Data$text, regex = "#[A-Za-z0-9]+")
Data[paste0("H", seq_len(max(sapply(Hash, length))))] <- stri_list2matrix(Hash, byrow = TRUE)
Data
# X text screenname U1 U2 U3 H1 H2
# 1 1 Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2 author1 @User1 @User2 @User3 #hashtag1 #hashtag2
# 2 2 Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4 author2 @User2 @User1 @User3 #hashtag3 #hashtag4