Search code examples
rcookieshttpsweb-scrapinghttr

R httr post-authentication download works in interactive mode but fails in function


the code below works fine in interactive mode but fails when used in a function. it's pretty simply two authentications POST commands followed by the data download. my goal is to get this working inside a function, not just in interactive mode.

this question is sort of a sequel to this question.. icpsr recently updated their website. the minimal reproducible example below requires a free account, available at

https://www.icpsr.umich.edu/rpxlogin?path=ICPSR&request_uri=https%3a%2f%2fwww.icpsr.umich.edu%2ficpsrweb%2findex.jsp

i tried adding Sys.sleep(1) and various httr::GET/httr::POST calls but nothing worked.

my_download <-
    function( your_email , your_password ){

        values <-
            list(
                agree = "yes",
                path = "ICPSR" ,
                study = "21600" ,
                ds = "" ,
                bundle = "rdata",
                dups = "yes",
                email=your_email,
                password=your_password
            )


        httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
        httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)

        tf <- tempfile()
        httr::GET( 
            "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
            query = values , 
            httr::write_disk( tf , overwrite = TRUE ) , 
            httr::progress()
        )

    }

# fails 
my_download( "email@address.com" , "some_password" )

# stepping through works
debug( my_download )
my_download( "email@address.com" , "some_password" )

EDIT the failure simply downloads this page as if not logged in (and not the dataset), so it's losing the authentication for some reason. if you are logged in to icpsr, use private browsing to see the page--

https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2?study=21600&ds=1&bundle=rdata&path=ICPSR

thanks!


Solution

  • This sort of thing can happen because the state (such as cookies) the httr package stores in the handle for each URL (see ?handle).

    In this particular case it remains unclear what exactly make it work, but one strategy is to include a GET request to https://www.icpsr.umich.edu/cgi-bin/bob/ prior to authenticating and requesting the data. For example,

    my_download <-
        function( your_email , your_password ){
            ## for some reason this is required ...
            httr::GET("https://www.icpsr.umich.edu/cgi-bin/bob/")
            values <-
                list(
                    agree = "yes",
                    path = "ICPSR" ,
                    study = "21600" ,
                    ds = "" ,
                    bundle = "rdata",
                    dups = "yes",
                    email=your_email,
                    password=your_password
                )
            httr::POST("https://www.icpsr.umich.edu/rpxlogin", body = values)
            httr::POST("https://www.icpsr.umich.edu/cgi-bin/terms", body = values)
            tf <- tempfile()
            httr::GET( 
                "https://www.icpsr.umich.edu/cgi-bin/bob/zipcart2" , 
                query = values , 
                httr::write_disk( tf , overwrite = TRUE ) , 
                httr::progress()
            )
        }
    

    appears to work correctly, though it remains unclear what the GET request to https://www.icpsr.umich.edu/cgi-bin/bob/` does exactly or why it is needed.