The purpose
For university research I try to process data of clinical studies publicly available here.
For reproducibility, I would like to directly use the downloaded JSON or XML files (and not to retrieve the data via the web API, which has been described: how-to-get-data-out-of-nested-xml-structure).
Update 1: The structure of the JSON file is published here
Update 2: The structure of the XML file is published here
I think tidyjson::read_json and
tidyjson::spread_all
do the trick! See the answer section.
What I need
For my workflow, I need to convert the data to data.frames (tidy data.frames would be even better). I prefer JSON, hoever, if there was a solution for the XML format I would be very glad.
Test data
A nested list that I generated of one of the downloaded JSON files with jsonlite::fromJSON("NCT0455805.json")
test <- list(FullStudy = list(Rank = 254369L, Study = list(ProtocolSection = list(
IdentificationModule = list(NCTId = "NCT01455805", OrgStudyIdInfo = list(
OrgStudyId = "SS2011UK"), Organization = list(OrgFullName = "Spinal Simplicity LLC",
OrgClass = "INDUSTRY"), BriefTitle = "Minuteman Spinal Fusion Implant Versus Surgical Decompression for Lumbar Spinal Stenosis",
OfficialTitle = "Efficacy and Quality of Life Following Treatment of Lumbar Spinal Stenosis, Spondylolisthesis or Degenerative Disc Disease With the Minuteman Interspinous Interlaminar Fusion Implant Versus Surgical Decompression"),
StatusModule = list(StatusVerifiedDate = "October 2020",
OverallStatus = "Active, not recruiting", ExpandedAccessInfo = list(
HasExpandedAccess = "No"), StartDateStruct = list(
StartDate = "June 2012"), PrimaryCompletionDateStruct = list(
PrimaryCompletionDate = "March 2024", PrimaryCompletionDateType = "Anticipated"),
CompletionDateStruct = list(CompletionDate = "March 2024",
CompletionDateType = "Anticipated"), StudyFirstSubmitDate = "October 13, 2011",
StudyFirstSubmitQCDate = "October 18, 2011", StudyFirstPostDateStruct = list(
StudyFirstPostDate = "October 20, 2011", StudyFirstPostDateType = "Estimate"),
LastUpdateSubmitDate = "October 22, 2020", LastUpdatePostDateStruct = list(
LastUpdatePostDate = "October 26, 2020", LastUpdatePostDateType = "Actual")),
SponsorCollaboratorsModule = list(ResponsibleParty = list(
ResponsiblePartyType = "Sponsor"), LeadSponsor = list(
LeadSponsorName = "Spinal Simplicity LLC", LeadSponsorClass = "INDUSTRY"),
CollaboratorList = list(Collaborator = list(list(CollaboratorName = "The Leeds Teaching Hospitals NHS Trust",
CollaboratorClass = "OTHER")))), OversightModule = list(
OversightHasDMC = "Yes"), DescriptionModule = list(BriefSummary = "Lumbar spinal stenosis (LSS), is a common disorder of narrowing of the spinal canal in the lower part of the back. This causes discomfort in the legs when standing or walking because of pressure on the spinal nerves.There are several treatment options for LSS including physiotherapy, lumbar surgical decompression procedures such as laminectomy, Foraminotomy, Discectomy and more recently devices for interspinous distraction such as the XSTOP® and from May 2011 Minuteman\231.\n\nSurgical decompression for LSS involves the removal of excess bone, ligament, and soft-tissue allowing more room for the nerves. The operation is usually preformed under general anaesthetic and with an average stay in hospital for 2-3 nights. Whereas the Minuteman\231 implant is preformed as a day case under local or general anaesthetic and involves implanting the device into the space between two back bones to relieve pressure on the nerves and, therefore, pain in the legs.\n\nThis is a multi centred (four sites) randomised controlled trial with a total sample of 50 participants after obtaining their informed consent. Participants will attend the pain clinic at the Hospitals for a baseline visit where they will be randomised with a ratio of 1:1 to receive either the Minuteman\231 Interspinous interlaminar fusion Implant or standard surgical decompression for the treatment of lumbar spinal stenosis (LSS). Following randomisation arrangements will be made for the participant to receive the randomised treatment. If allocated to Minuteman\231 Implant, the treatment will be conducted by the Pain Specialist identified at the site. If allocated to surgical decompression, the treatment will be conducted by the neuro/spinal-surgeon identified at the site. Participates will be followed up regularly for 60 months post implant to assess clinical efficacy, safety, participants function and quality of life of each treatment.",
DetailedDescription = "This is a prospective randomised study monitoring patients for up to 5 years post treatment. Only patients who have an appropriately diagnosed Lumbar Spinal Stenosis with intermittent claudication with/without low back pain, with no adequate symptomatic relief after at least 6 months of conservative treatment will be asked to give consent to be involved. Potential participants will be approached for enrollment 17days before the planned baseline visit. Patients will be given oral and written information about the trial as well as the patient information leaflet for the study. If informed consent is given their participation in this study will be for a maximum of 5 years."),
ConditionsModule = list(ConditionList = list(Condition = c("Lumbar Spinal Stenosis",
"Spondylolisthesis", "Degenerative Disc Disease"))), DesignModule = list(
StudyType = "Interventional", PhaseList = list(Phase = "Not Applicable"),
DesignInfo = list(DesignAllocation = "Randomized", DesignInterventionModel = "Parallel Assignment",
DesignPrimaryPurpose = "Treatment", DesignMaskingInfo = list(
DesignMasking = "None (Open Label)")), EnrollmentInfo = list(
EnrollmentCount = "50", EnrollmentType = "Anticipated")),
ArmsInterventionsModule = list(ArmGroupList = list(ArmGroup = list(
list(ArmGroupLabel = "Minuteman Fusion Implant", ArmGroupType = "Active Comparator",
ArmGroupDescription = "Minuteman\231 interspinous interlaminar fusion Implant (interspinous interlaminar fusion device) which gained CE Mark approval in May 2011",
ArmGroupInterventionList = list(ArmGroupInterventionName = "Device: Minuteman Fusion Implant")),
list(ArmGroupLabel = "Surgical decompression", ArmGroupType = "Other",
ArmGroupDescription = "Surgical decompression refers to the following operations Laminectomy, Foraminotomy, Discectomy or any other surgical procedure that the clinician feels is relevant for the decompression of lumbar spinal stenosis.",
ArmGroupInterventionList = list(ArmGroupInterventionName = "Procedure: surgical decompression")))),
InterventionList = list(Intervention = list(list(InterventionType = "Device",
InterventionName = "Minuteman Fusion Implant", InterventionDescription = "The Minuteman\231 interspinous interlaminar fusion device consists of a central threaded portion that has a two-part wing plate hinged near its proximal end, with spikes on the extended distal end of the wing plate, and a multi-spiked end cap plate that is located at the distal end of the device and is retained and tightened in place with a locking hex nut. Compression between the spiked wing plate and the spiked end cap plate serves to fix the spinous processes in place and to facilitate fusion, together with bone graft fusion material placed within the device. The threaded external body has been designed to provide ease of distraction and insertion via a minimally invasive surgical procedure.",
InterventionArmGroupLabelList = list(InterventionArmGroupLabel = "Minuteman Fusion Implant"),
InterventionOtherNameList = list(InterventionOtherName = "The Minuteman\231 interspinous interlaminar fusion device")),
list(InterventionType = "Procedure", InterventionName = "surgical decompression",
InterventionDescription = "Surgical decompression refers to the following operations Laminectomy, Foraminotomy, Discectomy or any other surgical procedure that the clinician feels is relevant for the decompression of lumbar spinal stenosis",
InterventionArmGroupLabelList = list(InterventionArmGroupLabel = "Surgical decompression"))))),
OutcomesModule = list(PrimaryOutcomeList = list(PrimaryOutcome = list(
list(PrimaryOutcomeMeasure = "Change from baseline of clinical efficacy up to 60 months post procedure",
PrimaryOutcomeDescription = "These include:\n\nVisual Analogue Scale (VAS) pain scores Leg Pain\nVisual Analogue Scale (VAS) pain scores Back Pain\nOswestry Disability Index (ODI)\nZurich Claudication Questionnaire (ZCQ)\nAssessment of Physical Function via distance walked in 5 minutes and number of repetitions of sitting to standing in 1 minute.\n\nThe main outcome will be a comparison between treatment groups based on the change from baseline at each follow-up visit for each of the measures listed above.",
PrimaryOutcomeTimeFrame = "8 weeks and up to 60 months post procedure."))),
SecondaryOutcomeList = list(SecondaryOutcome = list(list(
SecondaryOutcomeMeasure = "measures of quality of life",
SecondaryOutcomeDescription = "These include:\n\nChange in functional status questionnaire from baseline\nParticipants global impression of change from baseline (PGIC)\nClinician's global Impression of change from baseline (CGIC)\nEmployment status",
SecondaryOutcomeTimeFrame = "8 weeks and up to 60 months post procedure."),
list(SecondaryOutcomeMeasure = "Adverse events related to device and procedure",
SecondaryOutcomeTimeFrame = "safety to be assessed at 8 weeks and up to 60 months post procedure.")))),
EligibilityModule = list(EligibilityCriteria = "Inclusion Criteria:\n\nIs male or a non pregnant female aged 18years or older\nBMI = 35kg/m2\nHas chronic leg pain with or without back pain of greater than 6 months duration,which is partially or completely relieved by either sitting or adopting a flexed posture and who are suitable in the clinicians opinion for posterior lumbar surgery\nPre-operative ODI score = 20%\nPre-operative ZCQ Physical Function Domain =2\nPre-operative VAS Leg pain score = 4\nHas completed at least 6 months of conservative treatment without obtaining adequate symptomatic relief or has worsening neurological symptoms.\nHas degenerative changes at 1 or 2 levels confirmed by MRI or CT Myelogram within the last 12 months) with one or more of the following:\nLumbar spinal stenosis with intermittent neurogenic claudication\nDegeneration of the disc (as evidenced by imaging on MRI)\nAnnular thickening\nDegenerative Spondylolisthesis = Meyerding Grade 1\nThickening of ligamentum flavum\n\nExclusion Criteria:\n\nFixed motor deficit\nHas undergone previous lumbar spinal surgery\nIs unwilling or unable to give consent or adhere to the follow up schedule\nHas active infection or metastatic disease\nHas spondylolisthesis > grade 1\nHas neurogenic bladder or bowel disease\nHas a history of Osteopenia and or Osteoporosis. Evaluation of possible Osteopenia and or Osteoporosis will be conducted via a bone density scan prior to randomisation if ANY of the Bone Mass Evaluation criteria is met\nPatients who are not deemed fit for anaesthesia/major surgery due to underlying medical condition",
HealthyVolunteers = "No", Gender = "All", MinimumAge = "18 Years",
StdAgeList = list(StdAge = c("Adult", "Older Adult"))),
ContactsLocationsModule = list(OverallOfficialList = list(
OverallOfficial = list(list(OverallOfficialName = "Ganesan Baranidharan, Dr",
OverallOfficialAffiliation = "Leeds Teaching Hospitals NHS Trust",
OverallOfficialRole = "Principal Investigator"))),
LocationList = list(Location = list(list(LocationFacility = "Taunton & Somerset NHS Foundation Trust of Musgrove Park Hospital",
LocationCity = "Taunton", LocationState = "Somerset",
LocationZip = "TA1 5DA", LocationCountry = "United Kingdom"),
list(LocationFacility = "The Ipswich Hospital NHS Trust",
LocationCity = "Ipswich", LocationState = "Suffolk",
LocationZip = "IP4 5PD", LocationCountry = "United Kingdom"),
list(LocationFacility = "Pain and Interventional Neuromodulation Research Group, Pain Management Dept, Seacroft Hospital, Leeds Teaching Hospitals NHS Trust",
LocationCity = "Leeds", LocationState = "West Yorkshire",
LocationZip = "LS14 6UH", LocationCountry = "United Kingdom"),
list(LocationFacility = "The Dudley Group NHS Foundation Trust, Russell Hall Hospital",
LocationCity = "Birmingham", LocationZip = "DY1 2HQ",
LocationCountry = "United Kingdom"))))), DerivedSection = list(
MiscInfoModule = list(VersionHolder = "February 26, 2021"),
ConditionBrowseModule = list(ConditionMeshList = list(ConditionMesh = list(
list(ConditionMeshId = "D000013130", ConditionMeshTerm = "Spinal Stenosis"),
list(ConditionMeshId = "D000055959", ConditionMeshTerm = "Intervertebral Disc Degeneration"),
list(ConditionMeshId = "D000013168", ConditionMeshTerm = "Spondylolisthesis"),
list(ConditionMeshId = "D000003251", ConditionMeshTerm = "Constriction, Pathologic"))),
ConditionAncestorList = list(ConditionAncestor = list(
list(ConditionAncestorId = "D000020763", ConditionAncestorTerm = "Pathological Conditions, Anatomical"),
list(ConditionAncestorId = "D000013122", ConditionAncestorTerm = "Spinal Diseases"),
list(ConditionAncestorId = "D000001847", ConditionAncestorTerm = "Bone Diseases"),
list(ConditionAncestorId = "D000009140", ConditionAncestorTerm = "Musculoskeletal Diseases"),
list(ConditionAncestorId = "D000013169", ConditionAncestorTerm = "Spondylolysis"),
list(ConditionAncestorId = "D000055009", ConditionAncestorTerm = "Spondylosis"))),
ConditionBrowseLeafList = list(ConditionBrowseLeaf = list(
list(ConditionBrowseLeafId = "M26992", ConditionBrowseLeafName = "Intervertebral Disc Degeneration",
ConditionBrowseLeafAsFound = "Degenerative Disc Disease",
ConditionBrowseLeafRelevance = "high"), list(
ConditionBrowseLeafId = "M14546", ConditionBrowseLeafName = "Spondylolisthesis",
ConditionBrowseLeafAsFound = "Spondylolisthesis",
ConditionBrowseLeafRelevance = "high"), list(
ConditionBrowseLeafId = "M14510", ConditionBrowseLeafName = "Spinal Stenosis",
ConditionBrowseLeafAsFound = "Spinal Stenosis",
ConditionBrowseLeafRelevance = "high"), list(
ConditionBrowseLeafId = "M5058", ConditionBrowseLeafName = "Constriction, Pathologic",
ConditionBrowseLeafAsFound = "Stenosis", ConditionBrowseLeafRelevance = "high"),
list(ConditionBrowseLeafId = "M21103", ConditionBrowseLeafName = "Pathological Conditions, Anatomical",
ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "M14502",
ConditionBrowseLeafName = "Spinal Diseases",
ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "M3708",
ConditionBrowseLeafName = "Bone Diseases", ConditionBrowseLeafRelevance = "low"),
list(ConditionBrowseLeafId = "M10680", ConditionBrowseLeafName = "Musculoskeletal Diseases",
ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "M14547",
ConditionBrowseLeafName = "Spondylolysis", ConditionBrowseLeafRelevance = "low"),
list(ConditionBrowseLeafId = "M26580", ConditionBrowseLeafName = "Spondylosis",
ConditionBrowseLeafRelevance = "low"), list(ConditionBrowseLeafId = "T6038",
ConditionBrowseLeafName = "Quality of Life",
ConditionBrowseLeafRelevance = "low"))), ConditionBrowseBranchList = list(
ConditionBrowseBranch = list(list(ConditionBrowseBranchAbbrev = "BC05",
ConditionBrowseBranchName = "Muscle, Bone, and Cartilage Diseases"),
list(ConditionBrowseBranchAbbrev = "All", ConditionBrowseBranchName = "All Conditions"),
list(ConditionBrowseBranchAbbrev = "BC23", ConditionBrowseBranchName = "Symptoms and General Pathology"),
list(ConditionBrowseBranchAbbrev = "BXM", ConditionBrowseBranchName = "Behaviors and Mental Disorders"))))))))
What I already achieved
I can easily read a batch of JSON files to a list as described here (x= vector with paths to the files
)
library(parallel)
library(jsonlite)
cl <- makeCluster(detectCores() - 1)
json_list<-parLapply(cl,paths$path,function(x) jsonlite::fromJSON(x))
stopCluster(cl)
What I tried
I tried the option simplifyDatFrame = T
in jsonlite::fromJSON
, however, I get this error messages:
1: In (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
row names were found from a short variable and have been discarded
2: In (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
row names were found from a short variable and have been discarded
I tried a solution proposed (how-to-get-data-out-of-nested-xml-structure) for the the nested lists generated directly with the web API of clinicaltrials.gov.
as_tibble(test$FullStudy$Study)
Error: Tibble columns must have compatible sizes.
* Size 2: Column `DerivedSection`.
* Size 11: Column `ProtocolSection`.
i Only values of size one are recycled.
I tried to use tidyjson, however, I could not manage to get tidy data.frame from my nested lists.
The package tidyjson
works perfectly:
It is imortant to read the JSON file directly with tidyjson::read_json to get the right format (tbl_json (S3: tbl_json/tbl_df/tbl/data.frame)
for further processing.
#library
library(tidyjson)
# load the JSON file
tidyjson::read_json("NCT0455805.json") -> test
# check the data structure
str(test)
tbl_json [1 x 2] (S3: tbl_json/tbl_df/tbl/data.frame)
# make a tibble
test %>% tidyjson::spread_all()
> # A tibble: 1 x 42 ..JSON document.id FullStudy.Rank FullStudy.Study~ FullStudy.Study~ FullStudy.Study~ FullStudy.Study~
> FullStudy.Study~ FullStudy.Study~ FullStudy.Study~ <chr>
> <int> <dbl> <chr> <chr> <chr>
> <chr> <chr> <chr> <chr> 1
> "{\"F~ 1 254369 NCT01455805 Minuteman Spina~
> Efficacy and Qu~ October 2020 Active, not rec~ October 13, 2011
> October 18, 2011