Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73
It contains ratings in a file formatted as userID::movieID::rating::timestamp
Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).
Example, if the data file contains
1::1::1::10 2::2::2::11 1::2::3::12 2::1::5::13 3::3::4::14
Then the output matrix would look like:
UserID, Movie1, Movie2, Movie3 1, 1, 3, NA 2, 5, 2, NA 3, NA, NA, 3
So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.
You can use the dcast
function, in the reshape2
package, but the resulting data.frame may be huge (and sparse).
d <- read.delim(
"u1.base",
col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )
If your fields are separated by double colons, you cannot use the sep
argument of read.delim
, which has to be only one character.
If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g
), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.
d <- read.delim("a")
d <- as.character( d[,1] ) # vector of strings
d <- strsplit( d, "::" ) # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d ) # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )