Search code examples
rextractsocial-networking

Copying data from text into new columns in R


I have compiled a dataset of tweets using the Twitter API.

The dataset basically looks as follows:

Data <- data.frame(
  X = c(1,2),
  text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
  screenname = c("author1", "author2")
) 

Now I want to create a data.frame for social network analysis. I want to show how each of the screennames (in the case of this example "author1" etc.) is linked to users ("@User1" etc.) and hashtags ("#hashtag1", etc.).

To so, I need to extract/copy users and hashtags from the "text" column and write them in new columns. The data.frameshould look like this:

Data <- data.frame(
  X = c(1,2),
  text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
  screenname = c("author1", "author2"),
  U1 = c("@User1", "@User2"),
  U2 = c("@User2", "@User1"),
  U3 = c("@User3", "@User3"),
  U4 = c("",""),
  U5 = c("",""),
  H1 = c("#hashtag1", "#hashtag3"),
  H2 = c("#hashtag2", "#hashtag4"),
  H3 = c("",""),
  H4 = c("",""),
  H5 = c("","")
)

How can I extract/copy this information from the "text" column and write it into new columns?


Solution

  • Here's my simple attempt using stringi package. This method will create the amount of columns as the longest string in users and hastags, so this will work for any number of users or hashtags mentioned. This is also will be very efficient because this solution is mostly vectorized.

    library(stringi)
    Users <- stri_extract_all(Data$text, regex = "@[A-Za-z0-9]+")
    Data[paste0("U", seq_len(max(sapply(Users, length))))] <- stri_list2matrix(Users, byrow = TRUE)
    Hash <- stri_extract_all(Data$text, regex = "#[A-Za-z0-9]+")
    Data[paste0("H", seq_len(max(sapply(Hash, length))))] <- stri_list2matrix(Hash, byrow = TRUE)
    Data
    #   X                                                       text screenname     U1     U2     U3        H1        H2
    # 1 1 Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2    author1 @User1 @User2 @User3 #hashtag1 #hashtag2
    # 2 2 Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4    author2 @User2 @User1 @User3 #hashtag3 #hashtag4