Converting continuous variable to categorical with dplyrXdf

I'm trying to perform some initial exploration of some data. I am busy analysing one-ways of continuous variables by converting them to factors and calculating frequencies by bands.

I would like to do this with dplyrXdf but it doesn't seem to work the same as normal dplyr for what I'm attempting

sample_data <- RxXdfData("./data/test_set.xdf") #sample xdf for testing
as_data_frame <- rxXdfToDataFrame(sample_data) #same data as dataframe

# Calculate freq by Buildings Sum Insured band

Importing my sample data as a dataframe the below code works

buildings_ad_fr <- as_data_frame %>% 
  mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>% 
  group_by(bd_cut) %>% 
  summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
            ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))

But I cant do the same thing using the xdf version of the data

buildings_ad_fr_xdf <- sample_data %>% 
      mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>% 
      group_by(bd_cut) %>% 
      summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
                ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))

A workaround I can think would be to use rxDataStep to create the new column by passing through bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000)) in the transforms argument, but it shouldn’t be necessary to have an intermediate step.

I've tried using the .rxArgs function before the group_by expression but that also doesn't seem to work

buildings_ad_fr <- sample_data %>% 
  mutate(sample_data,.rxArgs = list(transforms = list(bd_cut = cut(BD_INSURED_VALUE,
                                                                   seq(150000,
                                                                       10000000,
                                                                       5000000)))))%>%
  group_by(bd_cut) %>% 
    summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
            ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))

Both times on the xdf file it gives the error Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions

Now I know this package can factorise variables but I am not sure how to use it to split up a continuous variable

Does anyone know how to do this?

Solution

The mutate should be fine. The summarise is different for Xdf files:

Internally summarise will run rxCube or rxSummary by default, which automatically remove NAs. You don't need na.rm=TRUE.
You can't summarise on an expression. The solution is to run the summarise and then compute the expression:

xdf %>%
    group_by(*) %>%
    summarise(expos=sum(expos), pd=sum(clms)) %>%
    mutate(pd=pd/expos)

I've also just updated dplyXdf to 0.10.0 beta, which adds support for HDFS/Spark and dplyr 0.7 along with several nifty utility functions. If you're not using it already, you might want to check it out. The formal release should happen when the next MRS version comes out.