I'm trying to perform some initial exploration of some data. I am busy analysing one-ways of continuous variables by converting them to factors and calculating frequencies by bands.
I would like to do this with dplyrXdf but it doesn't seem to work the same as normal dplyr for what I'm attempting
sample_data <- RxXdfData("./data/test_set.xdf") #sample xdf for testing
as_data_frame <- rxXdfToDataFrame(sample_data) #same data as dataframe
# Calculate freq by Buildings Sum Insured band
Importing my sample data as a dataframe the below code works
buildings_ad_fr <- as_data_frame %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
But I cant do the same thing using the xdf version of the data
buildings_ad_fr_xdf <- sample_data %>%
mutate(bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))) %>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
A workaround I can think would be to use rxDataStep to create the new column by passing through bd_cut = cut(BD_INSURED_VALUE, seq(from = 150000, to = 10000000,by = 5000000))
in the transforms argument, but it shouldn’t be necessary to have an intermediate step.
I've tried using the .rxArgs function before the group_by
expression but that also doesn't seem to work
buildings_ad_fr <- sample_data %>%
mutate(sample_data,.rxArgs = list(transforms = list(bd_cut = cut(BD_INSURED_VALUE,
seq(150000,
10000000,
5000000)))))%>%
group_by(bd_cut) %>%
summarise(exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),
ad_pd_f = sum(ACT_AD_PD_CLAIM_COUNT)/sum(BENEFIT_EXPOSURE, na.rm = TRUE))
Both times on the xdf file it gives the error Error in summarise.RxFileData(., exposure = sum(BENEFIT_EXPOSURE, na.rm = TRUE),: with xdf tbls only works with named variables, not expressions
Now I know this package can factorise variables but I am not sure how to use it to split up a continuous variable
Does anyone know how to do this?
The mutate
should be fine. The summarise
is different for Xdf files:
Internally summarise
will run rxCube
or rxSummary
by default, which automatically remove NAs. You don't need na.rm=TRUE
.
You can't summarise on an expression. The solution is to run the summarise and then compute the expression:
xdf %>%
group_by(*) %>%
summarise(expos=sum(expos), pd=sum(clms)) %>%
mutate(pd=pd/expos)
I've also just updated dplyXdf to 0.10.0 beta, which adds support for HDFS/Spark and dplyr 0.7 along with several nifty utility functions. If you're not using it already, you might want to check it out. The formal release should happen when the next MRS version comes out.