Search code examples
razureazure-data-lakemicrosoft-r

readLines equivalent when using Azure Data Lakes and R Server together


Using R Server, I want to simply read raw text (like readLines in base) from an Azure Data Lake. I can connect and get data like so:

library(RevoScaleR)

rxSetComputeContext("local")

oAuth <- rxOAuthParameters(params)
hdFS <- RxHdfsFileSystem(params)

file1 <- RxTextData("/path/to/file.txt", fileSystem = hdFS)

RxTextData doesn't actually go and get the data once that line is executed, it works as more of a symbolic link. When you run something like:

rxSummary(~. , data=file1)

Then the data is retrieved from the data lake. However, it is always read in and treated as a delimited file. I want to either:

  1. Download the file and store it locally with R code (preferably not).
  2. Use some sort of readLines equivalent to get the data from but read it in 'raw' so that I can do my own data quality checks.

Does this functionality exist yet? If so, how is this done?

EDIT: I have also tried:

returnDataFrame = FALSE

inside RxTextData. This returns a list. But as I've stated, the data isn't read in immediately from the data lake until I run something like rxSummary, which then attempts to read it as a regular file.

Context: I have a "bad" CSV file containing line feeds inside double quotes. This causes RxTextData to break. However, my script detects these occurrences and fixes them accordingly. Therefore, I don't want RevoScaleR to read in the data and try and interpret the delimiters.


Solution

  • I found a method of doing this by calling the Azure Data Lake Store REST API (adapted from a demo from Hadley Wickham's httr package on GitHub):

    library(httpuv)
    library(httr)
    
    # 1. Insert the app name ----
    app_name <- 'Any name'
    
    # 2. Insert the client Id ----
    client_id <- 'clientId'
    
    # 3. API resource URI ----
    resource_uri <- 'https://management.core.windows.net/'
    
    # 4. Obtain OAuth2 endpoint settings for azure. ----
    azure_endpoint <- oauth_endpoint(
        authorize = "https://login.windows.net/<tenandId>/oauth2/authorize",
        access = "https://login.windows.net/<tenandId>/oauth2/token"
        )
    
    # 5. Create the app instance ----
    myapp <- oauth_app(
      appname = app_name,
      key = client_id,
      secret = NULL
      )
    
    # 6. Get the token ----
    mytoken <- oauth2.0_token(
        azure_endpoint, 
        myapp,
        user_params = list(resource = resource_uri),
        use_oob = FALSE,
        as_header = TRUE,
        cache = FALSE
        )
    
    # 7. Get the file. --------------------------------------------------------
    test <- content(GET(
          url = "https://accountName.azuredatalakestore.net/webhdfs/v1/<PATH>?op=OPEN",
          add_headers(
            Authorization = paste("Bearer", mytoken$credentials$access_token),
            `Content-Type` = "application/json"
            )
      )) ## Returns as a binary body.
    
    df <- fread(readBin(test, "character")) ## use readBin to convert to text.