I have a list of CSV URLs from the net and have merged them to a vector.
Now, I want to read this list with read_csv
.
Example:
files <- c("csv_link1.csv",
"csv_link2.csv",
"csv_link3.csv",
and so on....)
data <- map_dfr(files, read_csv)
This is no problem. The Problem is that in the CSV files there are columns that are filled with different values. So, for example, in CSV1 there is column "V1", which is filled with double and in CSV the same column is "V1", filled with characters. Merging the CSVs does not work because they are different data types.
In my case, I think there are two possibilities to solve this.
read_csv
only reads column (V2 and V3) but not V1Or
col_types
I have tried both, but failed because of the correct syntax.
I tried something like
data <- map_dfr(files, read_csv(cols_only(the col names)))
But, this don't work.
How can I import and merge only specific columns?
Concrete Example in my case:
library(data.table)
library(readr)
library(purrr)
files <- c("https://www.football-data.co.uk/mmz4281/1920/EC.csv",
"https://www.football-data.co.uk/mmz4281/1819/EC.csv",
"https://www.football-data.co.uk/mmz4281/1718/EC.csv",
"https://www.football-data.co.uk/mmz4281/1617/EC.csv",
"https://www.football-data.co.uk/mmz4281/1516/EC.csv",
"https://www.football-data.co.uk/mmz4281/1415/EC.csv",
"https://www.football-data.co.uk/mmz4281/1314/EC.csv",
"https://www.football-data.co.uk/mmz4281/1213/EC.csv",
"https://www.football-data.co.uk/mmz4281/1112/EC.csv",
"https://www.football-data.co.uk/mmz4281/1011/EC.csv")
data <- map_dfr(files, read_csv)
Error: Can't combine `BbAH` <character> and `BbAH` <double>.
So i the column BbAH
has different data types. But i dont need this column.
It would be cool if i can chose the columns which would be merged befor the merge run into an error because of this different data type problem.
Since you only needs those seven variables, you can read in those specific variables using fread
to avoid the issue with the BbAH
variable.
library(data.table)
library(dplyr)
library(purrr)
files <- c("https://www.football-data.co.uk/mmz4281/1920/EC.csv",
"https://www.football-data.co.uk/mmz4281/1819/EC.csv",
"https://www.football-data.co.uk/mmz4281/1718/EC.csv",
"https://www.football-data.co.uk/mmz4281/1617/EC.csv",
"https://www.football-data.co.uk/mmz4281/1516/EC.csv",
"https://www.football-data.co.uk/mmz4281/1415/EC.csv",
"https://www.football-data.co.uk/mmz4281/1314/EC.csv",
"https://www.football-data.co.uk/mmz4281/1213/EC.csv",
"https://www.football-data.co.uk/mmz4281/1112/EC.csv",
"https://www.football-data.co.uk/mmz4281/1011/EC.csv")
# Identify columns you need
myColumns = c("Date","Time","HomeTeam","AwayTeam","FTHG","FTAG","FTR")
# Modified function found in https://stackoverflow.com/a/51348578/8535855
# takes a filename and a vector of columns as input
fread_allfiles <- function(file, columns){
x <- fread(file, select = columns) %>%
select(everything()) #
return(x)
}
df_all <- files %>%
map_df(~ fread_allfiles(.,myColumns))
head(df_all)
which produces the following format:
Date Time HomeTeam AwayTeam FTHG FTAG FTR 1: 03/08/2019 12:30 Stockport Maidenhead 0 1 A 2: 03/08/2019 15:00 Aldershot Fylde 1 2 A 3: 03/08/2019 15:00 Barnet Yeovil 1 0 H 4: 03/08/2019 15:00 Chesterfield Dover Athletic 1 2 A 5: 03/08/2019 15:00 Chorley Bromley 0 0 D 6: 03/08/2019 15:00 Dag and Red Woking 0 2 A
You can then reformat the Date
and Time
columns if needed. It looks like on the first file has any values for Time
? So the rest are filled in as NA
> str(df_all)
Classes ‘data.table’ and 'data.frame': 5429 obs. of 7 variables:
$ Date : chr "03/08/2019" "03/08/2019" "03/08/2019" "03/08/2019" ...
$ Time : chr "12:30" "15:00" "15:00" "15:00" ...
$ HomeTeam: chr "Stockport" "Aldershot" "Barnet" "Chesterfield" ...
$ AwayTeam: chr "Maidenhead" "Fylde" "Yeovil" "Dover Athletic" ...
$ FTHG : int 0 1 1 1 0 0 1 1 2 1 ...
$ FTAG : int 1 2 0 2 0 2 0 4 2 3 ...
$ FTR : chr "A" "A" "H" "A" ...
- attr(*, ".internal.selfref")=<externalptr>