Read Very Large xml file in R from Transport Simulation Outcome

I have an xml file which is about 20GB from the outcome of a transport simulation software. The XML data looks something like this:

<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
    <event time="10998.0" type="actend" person="pv_car_5315_9162_1" link="link36" actType="home"  />
    <event time="10998.0" type="departure" person="pv_car_5315_9162_1" link="link36" legMode="car"  />
    <event time="10998.0" type="PersonEntersVehicle" person="pv_car_5315_9162_1" vehicle="pv_car_5315_9162_1"  />
    <event time="10998.0" type="coldEmissionEvent" linkId="link36" vehicleId="pv_car_5315_9162_1" CH4="0.08123456"  />
    <event time="10998.0" type="vehicle enters traffic" person="pv_car_5315_9162_1" link="link36" vehicle="pv_car_5315_9162_1" networkMode="car" relativePosition="1.0"  />
    <event time="10999.0" type="left link" link="link36" vehicle="pv_car_5315_9162_1"  />
    <event time="10999.0" type="entered link" link="link65" vehicle="pv_car_5315_9162_1"  />
    <event time="11021.0" type="warmEmissionEvent" linkId="link65" vehicleId="pv_car_5315_9162_1" CO="0.07734376800000001" CO2_TOTAL="62.389434800000004" FC="20.239834596" HC="0.004187986" NMHC="0.0021681760000000004" NOx="0.1196987988" NO2="0.0373424708" PM="0.0011292424" SO2="3.03102E-4" FC_MJ="0.8570199968000001" CO2_rep="58.54285888000001" CO2e="59.168261720000004" PM2_5="0.0011292424" PM2_5_non_exhaust="0.005600000000000001" PM_non_exhaust="0.010399999600000001" BC_exhaust="6.607388E-4" BC_non_exhaust="5.600000000000001E-4" Benzene="1.27142E-4" PN="1.135524E12" Pb="0.0" CH4="0.0020198092" N2O="0.0019291828" NH3="0.0057024384"  />
    <event time="11021.0" type="left link" link="link65" vehicle="pv_car_5315_9162_1"  />
    <event time="11021.0" type="entered link" link="link52" vehicle="pv_car_5315_9162_1"  />
    <event time="11036.0" type="warmEmissionEvent" linkId="link52" vehicleId="pv_car_5315_9162_1" CO="0.038671884000000004" CO2_TOTAL="31.194717400000002" FC="10.119917298" HC="0.002093993" NMHC="0.0010840880000000002" NOx="0.0598493994" NO2="0.0186712354" PM="5.646212E-4" SO2="1.51551E-4" FC_MJ="0.42850999840000004" CO2_rep="29.271429440000006" CO2e="29.584130860000002" PM2_5="5.646212E-4" PM2_5_non_exhaust="0.0028000000000000004" PM_non_exhaust="0.0051999998000000006" BC_exhaust="3.303694E-4" BC_non_exhaust="2.8000000000000003E-4" Benzene="6.3571E-5" PN="5.67762E11" Pb="0.0" CH4="0.0010099046" N2O="9.645914E-4" NH3="0.0028512192"  />
    <event time="11036.0" type="left link" link="link52" vehicle="pv_car_5315_9162_1"  />
    <event time="11036.0" type="entered link" link="link25" vehicle="pv_car_5315_9162_1"  />
    <event time="11046.0" type="warmEmissionEvent" linkId="link25" vehicleId="pv_car_5315_9162_1" CO="0.038671884000000004" CO2_TOTAL="31.194717400000002" FC="10.119917298" HC="0.002093993" NMHC="0.0010840880000000002" NOx="0.0598493994" NO2="0.0186712354" PM="5.646212E-4" SO2="1.51551E-4" FC_MJ="0.42850999840000004" CO2_rep="29.271429440000006" CO2e="29.584130860000002" PM2_5="5.646212E-4" PM2_5_non_exhaust="0.0028000000000000004" PM_non_exhaust="0.0051999998000000006" BC_exhaust="3.303694E-4" BC_non_exhaust="2.8000000000000003E-4" Benzene="6.3571E-5" PN="5.67762E11" Pb="0.0" CH4="0.0010099046" N2O="9.645914E-4" NH3="0.0028512192"  />
</events>

What I want is very simple: the only useful information for me is those that type is warmEmissionEvent, and I want to have a data frame which shows each line the time, linkID, CO and CO2Total value.

The problem is that this will be a very large XML file (20GB), so my laptop will die with the code below:

library(xml2)

# Define the path to your MATSim emission event XML file
xml_file <- "output5.emission.events.offline.xml"

# Read the XML file
doc <- read_xml(xml_file)

# Extract the 'event' nodes with type 'warmEmissionEvent'
emission_events <- xml_find_all(doc, ".//event[@type='warmEmissionEvent']")

# Create empty lists to store the extracted data
event_time <- vector()
link_id <- vector()
vehicle_id <- vector()
CO <- vector()
CO2_total <- vector()

# Iterate over each 'event' node
for (event in emission_events) {
  # Extract the values of the desired attributes
  event_time <- c(event_time, as.numeric(xml_attr(event, "time")))
  link_id <- c(link_id, xml_attr(event, "linkId"))
  vehicle_id <- c(vehicle_id, xml_attr(event, "vehicleId"))
  CO <- c(CO, as.numeric(xml_attr(event, "CO")))
  CO2_total <- c(CO2_total, as.numeric(xml_attr(event, "CO2_TOTAL")))
}

# Create a data frame with the extracted data
emission_data <- data.frame(
  event_time = event_time,
  link_id = link_id,
  vehicle_id = vehicle_id,
  CO = CO,
  CO2_total = CO2_total
)

I tried some methods online, but they are not working really well :(

Do you maybe have a better solution for this? Or maybe some other tools that I can use to analyse the data?

Solution

Since it sounds like you don't have enough memory to read the entire file into memory at one time other options are needed.
Since you are looking to extract information from individuals lines it simplifies the requirements.
A possible options is to read in chunks of data, filter for "warmEmissionEvent", process and save the data before moving to the next chunk.

Maybe this will work:

library(stringr)

read_only =4

output <- lapply(seq(0, 16, read_only), function(skip_n) {
   myfile <- read.csv(file='xml_file', skip=skip_n, nrow=read_only)
   warmemission <- myfile[grep("warmEmissionEvent", myfile[,1]), ]
   
   time     <- str_extract(warmemission, "(?<=time=)[0-9.]*")
   linkID   <- str_extract(warmemission, "(?<=linkId=)[0-9.]*")
   CO       <- str_extract(warmemission, "(?<=CO=)[0-9.]*")
   CO2Total <- str_extract(warmemission, "(?<=CO2_TOTAL=)[0-9.]*")

   data.frame(time, linkID, CO, CO2Total)
   #maybe write df to different files to save intermediate steps  
})
dplyr::bind_rows(output)

Yes, it is a bit of brute force using str_extract, but with only 4 pieces of data, not worth my effort to get xml2 working.

Of course you will want to read in larger chunks instead of 4 lines at a time.
The sequence with need to end on the number of lines in your file and not 16. If you try reading pass the end of the file you will like generate an error and will need to try again, thus the recommendation to write the output from each chunk to a file.