I have very large DOCX files that I was hoping to parse through and be able to build a database of sorts that shows the frequency of a word/string in the documents. From what I gather this is definitely not an easy task. I was just hoping for some direction as to a library that I could use to help me with this.
This is an example of what one may look like. The structure isn't consistent so that will complicate things as well. Any direction will be appreciated!!!
If (as per your comment) you're able to do this in Python, look at the following snippets:
So first thing to realise is that docx files are actually .zip archives containing a number of XML files. Most text-content will be stored in the word/document.xml
. Word does some complicated things with numbered lists, which will require you to also load other XMLs like styles.xml
.
The markup of DOCX files can be a pain as the document is structured in w:p (paragraphs) and arbitrary w:r (runs). These runs are basically 'a bit of typing', so it can either be one letter, or a couple of words together.
We use UpdateableZipFile from https://stackoverflow.com/a/35435548. This was primarily because we also wanted to be able to edit the documents, so you could potentially just use snippets from it.
import UpdateableZipFile
from lxml import etree
source_file = UpdateableZipFile(os.path.join(path, self.input_file))
nsmap = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
'mc': "http://schemas.openxmlformats.org/markup-compatibility/2006",
} #you might need a few more namespace definitions if you get funky docx inputs
document = source_file.read_member('word/document.xml') #returns the root of an Etree object based on the document.xml xml tree.
# Query the XML element using xpaths (don't use Regex), this gives the text of all paragraph nodes:
paragraph_list = document.xpath("//w:p/descendant-or-self::*/text()", namespaces=self.nsmap)
You can then feed the text to NLP such as Spacy:
import spacy
nlp = spacy.load("en_core_web_sm")
word_counts = {}
for paragraph in paragraph_list:
doc = nlp(paragraph)
for token in doc:
if token.text in word_counts:
word_counts[token.text]+=1
else:
word_counts[token.text]=1
Spacy will tokenize the text for you, and can do lots more in terms of Named Entity Recognition, Parts of Speech tagging etc.