Search code examples
rdplyrdata.tablepurrrreadr

R: import and merge only specific columns


I have a list of CSV URLs from the net and have merged them to a vector. Now, I want to read this list with read_csv.

Example:

files <- c("csv_link1.csv",
             "csv_link2.csv",
             "csv_link3.csv",
              and so on....) 
data <- map_dfr(files, read_csv)

This is no problem. The Problem is that in the CSV files there are columns that are filled with different values. So, for example, in CSV1 there is column "V1", which is filled with double and in CSV the same column is "V1", filled with characters. Merging the CSVs does not work because they are different data types.

In my case, I think there are two possibilities to solve this.

  1. I only import certain columns, so I say read_csv only reads column (V2 and V3) but not V1

Or

  1. I merge the columns to the same data type with col_types

I have tried both, but failed because of the correct syntax.

I tried something like

data <- map_dfr(files, read_csv(cols_only(the col names)))

But, this don't work.

How can I import and merge only specific columns?

Concrete Example in my case:

library(data.table)
library(readr)
library(purrr)


files <- c("https://www.football-data.co.uk/mmz4281/1920/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1819/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1718/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1617/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1516/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1415/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1314/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1213/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1112/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1011/EC.csv")


data <- map_dfr(files, read_csv)

Error: Can't combine `BbAH` <character> and `BbAH` <double>.

So i the column BbAH has different data types. But i dont need this column. It would be cool if i can chose the columns which would be merged befor the merge run into an error because of this different data type problem.


Solution

  • Since you only needs those seven variables, you can read in those specific variables using fread to avoid the issue with the BbAH variable.

    library(data.table)
    library(dplyr)
    library(purrr)
    
    files <- c("https://www.football-data.co.uk/mmz4281/1920/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1819/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1718/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1617/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1516/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1415/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1314/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1213/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1112/EC.csv",
               "https://www.football-data.co.uk/mmz4281/1011/EC.csv")
    
    # Identify columns you need
    myColumns = c("Date","Time","HomeTeam","AwayTeam","FTHG","FTAG","FTR")
    
    # Modified function found in https://stackoverflow.com/a/51348578/8535855
    # takes a filename and a vector of columns as input
    fread_allfiles <- function(file, columns){
      x <- fread(file, select = columns) %>% 
        select(everything())   # 
      return(x)
    }
    
    df_all <- files %>% 
      map_df(~ fread_allfiles(.,myColumns))
    
    head(df_all)
    

    which produces the following format:

             Date  Time     HomeTeam       AwayTeam FTHG FTAG FTR
    1: 03/08/2019 12:30    Stockport     Maidenhead    0    1   A
    2: 03/08/2019 15:00    Aldershot          Fylde    1    2   A
    3: 03/08/2019 15:00       Barnet         Yeovil    1    0   H
    4: 03/08/2019 15:00 Chesterfield Dover Athletic    1    2   A
    5: 03/08/2019 15:00      Chorley        Bromley    0    0   D
    6: 03/08/2019 15:00  Dag and Red         Woking    0    2   A
    

    You can then reformat the Date and Time columns if needed. It looks like on the first file has any values for Time? So the rest are filled in as NA

    > str(df_all)
    Classes ‘data.table’ and 'data.frame':  5429 obs. of  7 variables:
     $ Date    : chr  "03/08/2019" "03/08/2019" "03/08/2019" "03/08/2019" ...
     $ Time    : chr  "12:30" "15:00" "15:00" "15:00" ...
     $ HomeTeam: chr  "Stockport" "Aldershot" "Barnet" "Chesterfield" ...
     $ AwayTeam: chr  "Maidenhead" "Fylde" "Yeovil" "Dover Athletic" ...
     $ FTHG    : int  0 1 1 1 0 0 1 1 2 1 ...
     $ FTAG    : int  1 2 0 2 0 2 0 4 2 3 ...
     $ FTR     : chr  "A" "A" "H" "A" ...
     - attr(*, ".internal.selfref")=<externalptr>