I'm trying to read and process a ~5.8GB .xml
from Wikipedia Dumps using R. I don't have so much RAM so I would like to process it in chunks. (Currently when using xml2::read_xml
blocks my computer completely)
The file contais one xml
element for each wikipedia page, like this:
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{Redr|move|from CamelCase|up}}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>
A sample of the file can be found here
From my perspective, I would think It's possible to read it in chunks, something like page per page in the file. Ans save each processed page
element as a line in a .csv
file.
I would like to have a data.frame with the following columns.
id, title and text.
How can I do to read this .xml
in chunks?
It can be improved, but the main ideia is here. You still need to define the best way to define the amount of lines you're going to read in each interaction inside the readLines()
function and also a method to read each chunk, but a solution for getting the chunks are here:
xml <- readLines("ptwiki-20161101-pages-articles.xml", n = 2000)
inicio <- grep(pattern = "<page>", x = xml)
fim <- grep(pattern = "</page>", x = xml)
if (length(inicio) > length(fim)) { # if you get more beginnings then ends
inicio <- inicio[-length(inicio)] # drop the last one
}
chunks <- vector("list", length(inicio))
for (i in seq_along(chunks)) {
chunks[[i]] <- xml[inicio[i]:fim[i]]
}
chunks <- sapply(chunks, paste, collapse = " ")
I've tried read_xml(chunks[1]) %>% xml_nodes("text") %>% xml_text()
and it worked out.