Search code examples
rdatabasedplyrtidyverse

Creating a unique id per username (dplyr) vs. Stata


I have a reddit dataset where each row represents a single reddit post, along with the username info. However, given that it's reddit data, the number of posts per username varies a lot (i.e. depending on how active a given username is on reddit). I am trying to create a unique id for each username and my data are structured as follows:

dput(df[1:5,c(2,3)])

output:

structure(list(date = structure(c(15149, 15150, 15150, 15150, 
15150), class = "Date"), username = c("تتطور", "عاطله فقط", 
"قصه ألم", "بشروني بوظيفة", "الواعده"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-5L), groups = structure(list(username = c("الواعده", 
"بشروني بوظيفة", "تتطور", "عاطله فقط", 
"قصه ألم"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))

I ran the following code where I tried replicate the code here

The code works w/out errors, but I am unable to create a unique id by username. #create an ID per observation

df <- df %>% 
  group_by(username)  %>% 
 mutate(id = row_number())%>% 
 relocate(id)

Print data example with specific columns

dput(df[1:10,c(1,4)])

output:

structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L), 
    username = c("تتطور", "عاطله فقط", "قصه ألم", 
    "بشروني بوظيفة", "الواعده", "ماخليتوآ لي اسم", 
    "مرافئ ساكنه", "معتوقة", "تتطور", "تتطور"
    )), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(username = c("الواعده", 
"بشروني بوظيفة", "تتطور", "عاطله فقط", 
"قصه ألم", "ماخليتوآ لي اسم", "مرافئ ساكنه", 
"معتوقة"), .rows = structure(list(5L, 4L, c(1L, 9L, 10L
), 2L, 3L, 6L, 7L, 8L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))

In Stata, I would do this as follows:

// create an id variable per username
egen id = group(username)

Solution

  • That's an incorrect use of group_by for your purpose. If you want to get an id just like your Stata code with egen, you may want to try this:

    df$id = as.integer(factor(df$username)) 
    

    This produced the same id as Stata

    egen id = group(username)
    

    Just FYI, I also tried dplyr::consecutive_id():

     df %>% mutate(
       id_dplyr = dplyr::consecutive_id(username)
       )
    

    but unable to reproduce Stata results with your example.