Search code examples
rdata-miningsparse-matrix

How to create vector matrix of movie ratings using R project?


Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73

It contains ratings in a file formatted as userID::movieID::rating::timestamp

Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).

Example, if the data file contains

1::1::1::10
2::2::2::11
1::2::3::12
2::1::5::13
3::3::4::14

Then the output matrix would look like:

UserID, Movie1, Movie2, Movie3
1, 1, 3, NA
2, 5, 2, NA
3, NA, NA, 3

So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.


Solution

  • You can use the dcast function, in the reshape2 package, but the resulting data.frame may be huge (and sparse).

    d <- read.delim(
      "u1.base", 
      col.names = c("user", "film", "rating", "timestamp")
    )
    library(reshape2)
    d <- dcast( d, user ~ film, value.var = "rating" )
    

    If your fields are separated by double colons, you cannot use the sep argument of read.delim, which has to be only one character. If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.

    d <- read.delim("a")
    d <- as.character( d[,1] )   # vector of strings
    d <- strsplit( d, "::" )     # List of vectors of strings of characters
    d <- lapply( d, as.numeric ) # List of vectors of numbers
    d <- do.call( rbind, d )     # Matrix
    d <- as.data.frame( d )
    colnames( d ) <- c( "user", "movie", "rating", "timestamp" )