Search code examples
rdownloaddata.tablereadr

How to use read_table or fread in this particular case?


As you know, read.table in R is a very useful but slow function, particularly when it comes to read big databases. In order to face problems related with that function, there exists functions such as read_table and fread from readr and data.table packages. Unfortunately, their arguments differ from read.table which made me difficult to replicate this example:

download.file("https://datasets.imdbws.com/title.basics.tsv.gz", "mov_title")
download.file("https://datasets.imdbws.com/title.ratings.tsv.gz", "mov_rating")

title <- read.table("mov_title", sep="\t", header=TRUE,
    fill=TRUE, na.strings="\\N", quote="")

rating <- read.table("mov_rating", sep="\t", header=TRUE,
    fill=TRUE, na.strings="\\N", quote="")

Basically I want to use fread or read_table (or both if it's possible) to create my "title" and "rating" databases. Any advice or reference will be much appreciated.


Solution

  • this seems to work just fine... data.table::fread() can handle gz-files.

    Set \t (=tab) as separator.
    Since some movie-titles contain quotes, set quotes to nothing; quote = "". (or not, and just accept the warnings).

    library( data.table )
    title  <- fread( "https://datasets.imdbws.com/title.basics.tsv.gz", 
                     sep = "\t", quote = "" )
    rating <- fread( "https://datasets.imdbws.com/title.ratings.tsv.gz", 
                     sep = "\t", quote = "" )