Create data frame from TEI XML using xml2

I'm trying to create a data frame of an TEI-XML version of Moby Dick using Hadley Wickham's xml2 package. I want the data frame to ultimately look like this (for all the words in the novel):

df <- data.frame(
chapter = c("1", "1", "1"),
words = c("call", "me", "ishmael"))

I'm able to get pieces, but not the whole thing. Here's what I've got so far:

library("xml2")

# Read file
melville <- read_xml("data/melville.xml")

# Get chapter divs (remember, doesn't include epilogue)
chap_frames <- xml_find_all(melville, "//d1:div1[@type='chapter']", xml_ns(melville))

This gives us a list with a length of 134 (that is, each of the chapters). We can get the chapter number for a specific element as follow:

xml_attr(chap_frames[[1]], "n")

We can get the paragraphs of a specific chapter (that is, minus the chapter heading) as follows:

words <- xml_find_all(chap_frames[[1]], ".//d1:p", xml_ns(melville)) %>%  # remember doesn't include epilogue
xml_text()

And we can get the words of the chapters as follows:

# Split words function
split_words <- function (ll) {
  result <- unlist(strsplit(ll, "\\W+"))
  result <- result[result != ""]
  tolower(result)
}

# Apply function
words <- split_words(words)

What I can't figure out is how to get the chapter number for each of the words. I had a toy example that worked:

mini <- read_xml(
'
<div1 type="chapter" n="1" id="_75784">
<head>Loomings</head>
    <p rend="fiction">Call me Ishmael.</p>
    <p rend="fiction">There now is your insular city of the Manhattoes, belted round by wharves as Indian isles by coral reefs- commerce surrounds it with her surf.</p> 
</div1>
')

# Function
process_chap <- function(div){
chapter <- xml_attr(div, "n")
words <- xml_find_all(div, "//p") %>%
    xml_text()
data.frame(chapter = chapter,
           word = split_words(words))
}

process_chap(mini)

But it doesn't work for the longer example

 process_chap2 <- function(div){
 chapter <- xml_attr(div, "n")
 words <- xml_find_all(div, ".//d1:p", xml_ns(melville)) %>%  # remember doesn't include epilogue
 xml_text()
 data.frame(chapter = chapter,
           word = split_words(words))
}

# Fails because there are more words than chapter names
df <- process_chap2(chap_frames)

# Gives all the words p (not chapters), chapter numbers are `NULL`. 
df2 <- process_chap2(melville)

(I know why toy example works but the Melville ones doesn't, but I wanted to include it to show what I'm trying to do). I'm guessing I might need a loop of some sort, but I'm not sure where to begin. Any suggestions?

PS: I'm not entirely sure if I should link to an xml version of Moby Dick I found on Github, but you can find it easily enough searching for melville1.xml.

Solution

The approach is to grap the data for each chapter one at a time. Then combine the words of one chapter together with the chapter number into a data frame. R will repeat the single value for the chapter number as often as needed:

words <- letters[1:3]
n     <- 1

df <- data.frame(words, n)

df
##  words n
## 1     a 1
## 2     b 1
## 3     c 1

Having gathered the information for all your chapters in neat data frames you can then use rbind() to combine the whole into one data frame.

That's how this might look like for the first two chapters of your data ....

library(xml2)
library(dplyr)
library(stringr)


# Read file
url <- "https://raw.githubusercontent.com/reganna/TextAnalysisWithR/master/data/XML1/melville1.xml"
melville <- read_xml(url)


# get chapter frame and number
chap_frames <- xml_find_all(melville, "//d1:div1[@type='chapter']", xml_ns(melville))
chap_n <- xml_attr(chap_frames, "n")


# get the date for first chapter
words1 <- 
  xml_find_all(chap_frames[[1]], ".//d1:p", xml_ns(melville))  %>% 
    xml_text() %>% 
    unlist() %>% 
    str_split("\\W+") %>% 
    unlist()  %>% 
    tolower()

n1 <- xml_attr(chap_frames[[1]], "n")


# get the data for the second chapter
words2 <- 
  xml_find_all(chap_frames[[2]], ".//d1:p", xml_ns(melville))  %>% 
  xml_text() %>% 
  unlist() %>% 
  str_split("\\W+") %>% 
  unlist()  %>% 
  tolower()

n2 <- xml_attr(chap_frames[[2]], "n")


# put it together
df <- 
  rbind(
    data_frame(words=words1, chapter=n1),
    data_frame(words=words2, chapter=n2)
  )
df


## Source: local data frame [3,719 x 2]
## 
##      words chapter
## 1     call       1
## 2       me       1
## 3  ishmael       1
## 4     some       1
## 5    years       1
## 6      ago       1
## 7    never       1
## 8     mind       1
## 9      how       1
## 10    long       1
## ..     ...     ...

To do this more efficient for all your chapters you might build a loop that repeats the steps for all chapters or you might consider a function that does the extraction, apply it to all chapters and then combine the data via rbind() later on.

... I would probably do it like that:

# building function
extract_data <- function(chapter_frame){
  words <- 
    xml_find_all(chapter_frame, ".//d1:p", xml_ns(melville))  %>% 
    xml_text() %>% 
    unlist() %>% 
    str_split("\\W+") %>% 
    unlist()  %>% 
    tolower()
  n   <- xml_attr(chapter_frame, "n")
  pos <- seq_along(words)
  data_frame(words, chapter=n, paragraph=pos)
}

# using function
chapter_words <- 
  lapply(chap_frames, extract_data) 

# `rbind()`ing data
chapter_words <- do.call(rbind, chapter_words)

chapter_words
## Source: local data frame [216,669 x 3]
## 
##      words chapter  paragraph
## 1     call       1          1
## 2       me       1          2
## 3  ishmael       1          3
## 4     some       1          4
## 5    years       1          5        
## 6      ago       1          6
## 7    never       1          7
## 8     mind       1          8
## 9      how       1          9
## 10    long       1         10 
## ..     ...     ...        ...