My goal is to read in zip files directly from the web ( Each zip file contains multiple .txt files. In my example I am trying to retrieve the data of the routes.txt file.
So my code is the following:
# links
tt_url <- c("",
# download zip files
f_get_data <- function(i, data){
url <- tt_url[i]
zip_file <- tempfile(fileext = ".zip")
download.file(url, zip_file, mode = "wb")
df <- read_delim(unzip(zip_file, files = data), delim = ",") %>%
mutate(year = i + 2015)
test_1 <- f_get_data(1, "routes.txt")
test_2 <- f_get_data(2, "routes.txt")
If one applies the function f_get_data(1, "routes.txt) the first time, the retrieved df ,test_1, is correct.
# A tibble: 6 × 8
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
<chr> <lgl> <chr> <lgl> <dbl> <lgl> <lgl> <dbl>
1 11-21-j16-1 NA 021 NA 3 NA NA 2016
2 11-22-j16-1 NA 022 NA 3 NA NA 2016
3 16-22-j16-1 NA 022 NA 3 NA NA 2016
4 11-25-j16-1 NA 025 NA 3 NA NA 2016
5 11-41-j16-1 NA 041 NA 3 NA NA 2016
6 11-42-j16-1 NA 042 NA 3 NA NA 2016
If I go onto the next period with f_get_data(2, "routes.txt), the retrieved df, test_2, is also correct.
BUT, after I completed my second iteration, the first df, test_1, corrupts itself:
> head(test_2)
# A tibble: 6 × 7
route_id agency_id route_short_name route_long_name route_desc route_type year
<chr> <chr> <chr> <lgl> <chr> <dbl> <dbl>
1 79-0-j17-1 881 00 NA Bus 700 2017
2 11-61-j17-1 7031 061 NA Bus 700 2017
3 11-62-j17-1 7031 062 NA Bus 700 2017
4 24-64-j17-1 801 064 NA Bus 700 2017
5 24-65-j17-1 801 065 NA Bus 700 2017
6 24-66-j17-1 801 066 NA Bus 700 2017
> head(test_1)
# A tibble: 6 × 8
route_id agency_id route_short_name route_long_name route_type route_color route_text_color year
<chr> <lgl> <chr> <lgl> <dbl> <lgl> <lgl> <dbl>
1 ",\"00\",\"\",\"Bus" NA "00\"\r\n" NA NA NA NA 2016
2 "7031\",\"061\",\"" NA "us\",\"" NA NA NA NA 2016
3 "7-1\",\"7031\",\"" NA ",\"\",\"" NA NA NA NA 2016
4 "-64-j17-1\",\"8" NA "064" NA NA NA NA 2016
5 "\r\n\"24-65-j17-" NA "801\"," NA NA NA NA 2016
6 "700\r\n24-66" NA "-1\",\"" NA NA NA NA 2016
Does anyone know why and especially how this happens? In my opinion, after I have assigned the retrieved data of my function to a certain data frame, it should be independent of the later use of my function.
The problem is the default behavior of the read_delim()
function. In order to improve performance the data is loaded in a lazy manner, meaning the data is only accessed when needed.
So in actuality the return value from "f_get_data" is just a pointer to the data. In this case it is a pointer your temporary file which is overwritten on each call to the function.
To solve this, set lazy to FALSE in the read_delim()
function call.
df <- read_delim(unzip(zip_file, files = data), delim = ",", lazy=FALSE) %>%
mutate(year = i + 2015)