I'm trying to replicate what the author of this paper has achieved with the publicly available Medicare dataset.
In summary, the author groups medical provider claims by the providers ID, their taxonomy and the HCPCS (codes of the procedures they performed) by most frequent to least frequent, see below image:
Using the code below I have been able to recreate the top left hand table and also the bottom left table (I don't think it's necessary) but I don't know how to group each providers HCPCS codes by highest frequency to lowest frequency in preparation for feeding it into word2vec to train an embedding model.
If I could get some help preparing the data ready for word2vec training I would be very grateful.
library(httr)
library(jsonlite)
library(tidyverse)
# CONNECT TO CMS DATA
res <- GET("https://data.cms.gov/data-api/v1/dataset/5fccd951-9538-48a7-9075-6f02b9867868/data?size=5000")
# CONVERT TO DATA FRAME
data = fromJSON(rawToChar(res$content))
# GROUPING AND COUNTING OCCURANCES OF HCPCS PER PROVIDER ID
providerHCPCS <- data %>%
group_by(Rndrng_NPI,Rndrng_Prvdr_Type,HCPCS_Cd) %>%
count(HCPCS_Cd, name = "Line_Srvc_Cnt") %>%
group_by(Rndrng_NPI) %>%
arrange(desc(Line_Srvc_Cnt), .by_group = TRUE)
Is this what you want as a result?
table2 <- providerHCPCS %>% group_by(Rndrng_NPI, Rndrng_Prvdr_Type) %>% summarise(HCPCS_sequence = (paste(HCPCS_Cd, collapse=", ")))