Using R Server, I want to simply read raw text (like readLines in base) from an Azure Data Lake. I can connect and get data like so:
library(RevoScaleR)
rxSetComputeContext("local")
oAuth <- rxOAuthParameters(params)
hdFS <- RxHdfsFileSystem(params)
file1 <- RxTextData("/path/to/file.txt", fileSystem = hdFS)
RxTextData
doesn't actually go and get the data once that line is executed, it works as more of a symbolic link. When you run something like:
rxSummary(~. , data=file1)
Then the data is retrieved from the data lake. However, it is always read in and treated as a delimited file. I want to either:
readLines
equivalent to get the data from but read it in 'raw' so that I can do my own data quality checks.Does this functionality exist yet? If so, how is this done?
EDIT: I have also tried:
returnDataFrame = FALSE
inside RxTextData
. This returns a list. But as I've stated, the data isn't read in immediately from the data lake until I run something like rxSummary
, which then attempts to read it as a regular file.
Context: I have a "bad" CSV file containing line feeds inside double quotes. This causes RxTextData to break. However, my script detects these occurrences and fixes them accordingly. Therefore, I don't want RevoScaleR to read in the data and try and interpret the delimiters.
I found a method of doing this by calling the Azure Data Lake Store REST API (adapted from a demo from Hadley Wickham's httr
package on GitHub):
library(httpuv)
library(httr)
# 1. Insert the app name ----
app_name <- 'Any name'
# 2. Insert the client Id ----
client_id <- 'clientId'
# 3. API resource URI ----
resource_uri <- 'https://management.core.windows.net/'
# 4. Obtain OAuth2 endpoint settings for azure. ----
azure_endpoint <- oauth_endpoint(
authorize = "https://login.windows.net/<tenandId>/oauth2/authorize",
access = "https://login.windows.net/<tenandId>/oauth2/token"
)
# 5. Create the app instance ----
myapp <- oauth_app(
appname = app_name,
key = client_id,
secret = NULL
)
# 6. Get the token ----
mytoken <- oauth2.0_token(
azure_endpoint,
myapp,
user_params = list(resource = resource_uri),
use_oob = FALSE,
as_header = TRUE,
cache = FALSE
)
# 7. Get the file. --------------------------------------------------------
test <- content(GET(
url = "https://accountName.azuredatalakestore.net/webhdfs/v1/<PATH>?op=OPEN",
add_headers(
Authorization = paste("Bearer", mytoken$credentials$access_token),
`Content-Type` = "application/json"
)
)) ## Returns as a binary body.
df <- fread(readBin(test, "character")) ## use readBin to convert to text.