I have a file that looks like this:
type created_at repository_name
1 IssuesEvent 2012-03-11 06:48:31 bootstrap
2 IssuesEvent 2012-03-11 06:48:31 bootstrap
3 IssuesEvent 2012-03-11 06:48:31 bootstrap
4 IssuesEvent 2012-03-11 06:52:50 bootstrap
5 IssuesEvent 2012-03-11 06:52:50 bootstrap
6 IssuesEvent 2012-03-11 06:52:50 bootstrap
7 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
8 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
9 IssueCommentEvent 2012-03-11 07:03:57 bootstrap
10 IssuesEvent 2012-03-11 07:03:58 bootstrap
11 IssuesEvent 2012-03-11 07:03:58 bootstrap
12 IssuesEvent 2012-03-11 07:03:58 bootstrap
13 WatchEvent 2012-03-11 07:15:44 bootstrap
14 WatchEvent 2012-03-11 07:15:44 bootstrap
15 WatchEvent 2012-03-11 07:15:44 bootstrap
16 WatchEvent 2012-03-11 07:18:45 hogan.js
17 WatchEvent 2012-03-11 07:18:45 hogan.js
18 WatchEvent 2012-03-11 07:18:45 hogan.js
The dataset that I'm working with can be accessed on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv.
I want to create a table that has a column for each entry in the "repository_name" column (e.g. bootstrap, hogan.js). In that column I need to have the data from the "type" column that corresponds to that entry (i.e. only rows form the current "type" column that also has the value "bootstrap" in the current "repository_name" column should fall under the new "bootstrap" column). Hence:
In other words, I would want a table that looks something like this:
bootstrap hogan.js
1 IssuesEvent PushEvent
2 IssuesEvent IssuesEvent
3 OssueCommentEvent WatchEvent
How can I accomplish this in R?
Some of my failed attempts using the reshape package can be found on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/reshaping_bigqueries.R.
Your sample data:
data <- structure(list(type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("IssueCommentEvent",
"IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L,
6L), .Label = c("2012-03-11 06:48:31", "2012-03-11 06:52:50",
"2012-03-11 07:03:57", "2012-03-11 07:03:58", "2012-03-11 07:15:44",
"2012-03-11 07:18:45"), class = "factor"), repository_name = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L), .Label = c("bootstrap", "hogan.js"), class = "factor")), .Names = c("type",
"created_at", "repository_name"), class = "data.frame", row.names = c(NA,
-18L))
I gather from your expected output that you want only one type
when it shows up multiple times for the same created_at
value, in other words you want to remove duplicates:
data <- unique(data)
Then, to extract all type
entries per repository_name
in the order they appear, you can simply use:
data.split <- split(data$type, data$repository_name)
data.split
# $bootstrap
# [1] IssuesEvent IssuesEvent IssueCommentEvent
# [4] IssuesEvent WatchEvent
# Levels: IssueCommentEvent IssuesEvent WatchEvent
#
# $hogan.js
# [1] WatchEvent
# Levels: IssueCommentEvent IssuesEvent WatchEvent
It returns a list which is the R data structure of choice for a collection of vectors with different lengths.
Edit: Now that you have provided an example of your output data, it has become more apparent that your expected output is indeed a data.frame. You can convert the list above into a data.frame padded with NA
s using the following function:
list.to.df <- function(arg.list) {
max.len <- max(sapply(arg.list, length))
arg.list <- lapply(arg.list, `length<-`, max.len)
as.data.frame(arg.list)
}
df.out <- list.to.df(data.split)
df.out
# bootstrap hogan.js
# 1 IssuesEvent WatchEvent
# 2 IssuesEvent <NA>
# 3 IssueCommentEvent <NA>
# 4 IssuesEvent <NA>
# 5 WatchEvent <NA>
You can then save that to a file using
write.csv(df.out, file = "out.csv", quote = FALSE, na = "", row.names = FALSE)
to get the exact same output format as the one you published on github.