Search code examples
rcsvdata-manipulationtraminer

Pivoting a CSV file using R


I have a file that looks like this:

                 type          created_at repository_name
1         IssuesEvent 2012-03-11 06:48:31       bootstrap
2         IssuesEvent 2012-03-11 06:48:31       bootstrap
3         IssuesEvent 2012-03-11 06:48:31       bootstrap
4         IssuesEvent 2012-03-11 06:52:50       bootstrap
5         IssuesEvent 2012-03-11 06:52:50       bootstrap
6         IssuesEvent 2012-03-11 06:52:50       bootstrap
7   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
8   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
9   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
10        IssuesEvent 2012-03-11 07:03:58       bootstrap
11        IssuesEvent 2012-03-11 07:03:58       bootstrap
12        IssuesEvent 2012-03-11 07:03:58       bootstrap
13         WatchEvent 2012-03-11 07:15:44       bootstrap
14         WatchEvent 2012-03-11 07:15:44       bootstrap
15         WatchEvent 2012-03-11 07:15:44       bootstrap
16         WatchEvent 2012-03-11 07:18:45        hogan.js
17         WatchEvent 2012-03-11 07:18:45        hogan.js
18         WatchEvent 2012-03-11 07:18:45        hogan.js

The dataset that I'm working with can be accessed on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv.

I want to create a table that has a column for each entry in the "repository_name" column (e.g. bootstrap, hogan.js). In that column I need to have the data from the "type" column that corresponds to that entry (i.e. only rows form the current "type" column that also has the value "bootstrap" in the current "repository_name" column should fall under the new "bootstrap" column). Hence:

  • Time stamps is just for ordering and do not need to by synchronized across the row (in fact they can be deleted, as the data is already sorted according to timestamps)
  • Even if "IssuesEvent" is repeated 10x I need to retain all of these, since I will be doing sequence analysis using the R package TraMineR
  • Columns can be of unequal length
  • There is no relationship between the columns for different repos ("repository_name")

In other words, I would want a table that looks something like this:

     bootstrap            hogan.js
1    IssuesEvent          PushEvent
2    IssuesEvent          IssuesEvent
3    OssueCommentEvent    WatchEvent

How can I accomplish this in R?

Some of my failed attempts using the reshape package can be found on https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/reshaping_bigqueries.R.


Solution

  • Your sample data:

    data <- structure(list(type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("IssueCommentEvent", 
    "IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
    6L), .Label = c("2012-03-11 06:48:31", "2012-03-11 06:52:50", 
    "2012-03-11 07:03:57", "2012-03-11 07:03:58", "2012-03-11 07:15:44", 
    "2012-03-11 07:18:45"), class = "factor"), repository_name = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L), .Label = c("bootstrap", "hogan.js"), class = "factor")), .Names = c("type", 
    "created_at", "repository_name"), class = "data.frame", row.names = c(NA, 
    -18L))
    

    I gather from your expected output that you want only one type when it shows up multiple times for the same created_at value, in other words you want to remove duplicates:

    data <- unique(data)
    

    Then, to extract all type entries per repository_name in the order they appear, you can simply use:

    data.split <- split(data$type, data$repository_name)
    data.split
    # $bootstrap
    # [1] IssuesEvent       IssuesEvent       IssueCommentEvent
    # [4] IssuesEvent       WatchEvent       
    # Levels: IssueCommentEvent IssuesEvent WatchEvent
    # 
    # $hogan.js
    # [1] WatchEvent
    # Levels: IssueCommentEvent IssuesEvent WatchEvent
    

    It returns a list which is the R data structure of choice for a collection of vectors with different lengths.

    Edit: Now that you have provided an example of your output data, it has become more apparent that your expected output is indeed a data.frame. You can convert the list above into a data.frame padded with NAs using the following function:

    list.to.df <- function(arg.list) {
       max.len  <- max(sapply(arg.list, length))
       arg.list <- lapply(arg.list, `length<-`, max.len)
       as.data.frame(arg.list)
    }
    
    df.out <- list.to.df(data.split)
    df.out
    #           bootstrap   hogan.js
    # 1       IssuesEvent WatchEvent
    # 2       IssuesEvent       <NA>
    # 3 IssueCommentEvent       <NA>
    # 4       IssuesEvent       <NA>
    # 5        WatchEvent       <NA>
    

    You can then save that to a file using

    write.csv(df.out, file = "out.csv", quote = FALSE, na = "", row.names = FALSE)
    

    to get the exact same output format as the one you published on github.