I have a reddit dataset where each row represents a single reddit post, along with the username info. However, given that it's reddit data, the number of posts per username varies a lot (i.e. depending on how active a given username is on reddit). I am trying to create a unique id for each username and my data are structured as follows:
dput(df[1:5,c(2,3)])
output:
structure(list(date = structure(c(15149, 15150, 15150, 15150,
15150), class = "Date"), username = c("تتطور", "عاطله فقط",
"قصه ألم", "بشروني بوظيفة", "الواعده"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم"), .rows = structure(list(5L, 4L, 1L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))
I ran the following code where I tried replicate the code here
The code works w/out errors, but I am unable to create a unique id by username. #create an ID per observation
df <- df %>%
group_by(username) %>%
mutate(id = row_number())%>%
relocate(id)
dput(df[1:10,c(1,4)])
output:
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L),
username = c("تتطور", "عاطله فقط", "قصه ألم",
"بشروني بوظيفة", "الواعده", "ماخليتوآ لي اسم",
"مرافئ ساكنه", "معتوقة", "تتطور", "تتطور"
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), groups = structure(list(username = c("الواعده",
"بشروني بوظيفة", "تتطور", "عاطله فقط",
"قصه ألم", "ماخليتوآ لي اسم", "مرافئ ساكنه",
"معتوقة"), .rows = structure(list(5L, 4L, c(1L, 9L, 10L
), 2L, 3L, 6L, 7L, 8L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .drop = TRUE))
In Stata, I would do this as follows:
// create an id variable per username
egen id = group(username)
That's an incorrect use of group_by
for your purpose. If you want to get an id just like your Stata code with egen
, you may want to try this:
df$id = as.integer(factor(df$username))
This produced the same id
as Stata
egen id = group(username)
Just FYI, I also tried dplyr::consecutive_id()
:
df %>% mutate(
id_dplyr = dplyr::consecutive_id(username)
)
but unable to reproduce Stata results with your example.