Using python 3.5
I need to find specific text that's stored in old-style, 1997-2003 windows .doc files and dump it into a csv. My constraints are:
a) doc files are in a zipped archive: I can't write to disk/I need to work in memory
b) I need to find specific text with regex, so the doc need to be converted to .txt
Ideally I could read the files with zipfile, pass the data on to some doc-to-txt converter (e.g. textract), and regex on the txt. This might look like
import zipfile
import textract
import re
with zipfile.ZipFile(zip_archive, 'r') as f:
for name in f.namelist():
data = f.read(name)
txt = textract.process(data).decode('utf-8')
#some regex on txt
This of course doesn't work, because the argument for textract (and any other doc-to-txt converter) is a filepath, while "data" is bytes. Using "name" as the argument gives a MissingFileError, probably because zip archives don't have directory structures, just filenames simulating paths.
Is there any way to regex through zipped doc files only in memory, without extracting the files (and therefore writing them to disk)?
Working with files without writing to a physical drive
In most cases, the files within a zip have to be extracted first to be processed. But this can be done in memory. The roadblock is how to invoke a utility that takes only a mapped filesystem path as an argument to process the text in the zipped files without writing to the physical drive.
Internally textract
invokes a command line utility (antiword) that does the actual text extraction. So the approach that solves this could be applied generally to other command line tools that need access to zip contents via a filesystem path.
Below are several possible solutions to get around this restriction on files:
sudo
prompt, but that can be automated.fuse-zip
.tempfile
module. (easiest)
docx2txt
is another Python module, but it looks like it will only handle .docx files (as its name implies) and not old Word .doc files.Why did I do all this leg-work, you may wonder. I actually found this useful for one of my own projects.
1) RAM Drive
If tempfile
doesn't satisfy the file constraint goals, and you want to ensure all files used by the tool are in RAM, creating a RAM drive is a great option. The tool should unmount the drive when it's done, which will delete all the files it stored.
A plus with this option is that Linux systems all support this natively. It doesn't incur any additional software dependencies; at least for Linux, Windows will probably require ImDisk.
These are the relevant bash commands on Linux:
$ mkdir ./temp_drive
$ sudo mount -t tmpfs -o size=512m temp_drive ./temp_drive
$
$ mount | tail -n 1 # To see that it was mounted.
$ sudo umount ./temp_drive # To unmount.
On MacOS:
$ diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nomount ram://1048576 `
$ # 512M drive created: 512 * 2048 == 1048576
On Windows:
On Windows, you may have to use a 3rd party application like ImDisk:
To automate the process, this short script prompts the user for their sudo password, then invokes mount
to create a RAM drive:
import subprocess as sp
import tempfile
import platform
import getpass
ramdrv = tempfile.TemporaryDirectory()
if platform.system() == 'Linux':
sudo_pw = getpass.getpass("Enter sudo password: ")
# Mount RAM drive on Linux.
p = sp.Popen(['sudo', '-S', 'bash', '-c',
f"mount -t tmpfs -o size=512m tmpfs {ramdrv.name}"],
stderr=sp.STDOUT, stdout=sp.PIPE, stdin=sp.PIPE, bufsize=1,
encoding='utf-8')
print(sudo_pw, file=p.stdin)
del sudo_pw
print(p.stdout.readline())
elif platform.system() == 'Darwin':
# And so on...
Whatever GUI package your application uses likely has a password dialog, but getpass
works well for console applications.
To access the RAM drive, use the folder it's mounted on like any other file in the system. Write files to it, read files from it, create subfolders, etc.
2) Mount the Zip file
If the Zip file can be mounted on the OS file system, then its files will have paths that can be passed to textract
. This could be the best option.
For Linux, a utility that works well is fuse-zip
. The few lines below install it, and mount a zip file.
$ sudo apt-get install fuse-zip
...
$ mkdir ~/archivedrive
$
$ fuse-zip ~/myarchive.zip ~/archivedrive
$ cd ~/archivedrive/myarchive # I'm inside the zip!
From Python, create temporary mount point, mount the zip, extract text, then unmount the zip:
>>> import subprocess as sp, tempfile, textract
>>>
>>> zf_path = '/home/me/marine_life.zip'
>>> zipdisk = tempfile.TemporaryDirectory() # Temp mount point.
>>>
>>> cp = sp.run(['fuse-zip', zf_path, zipdisk.name]) # Mount.
>>> cp.returncode
0
>>> all_text = textract.process(f"{zipdisk.name}/marine_life/octopus.doc")
>>>
>>> cp = sp.run(['fusermount', '-u', zipdisk.name]) # Unmount.
>>> cp.returncode
0
>>> del zipdisk # Delete mount point.
>>> all_text[:88]
b'The quick Octopuses live in every ocean, and different species have\n
adapted to different'
>>>
>>> # Convert bytes to str if needed.
>>> as_string = all_text.decode('latin-1', errors='replace')
A big plus with using this approach is it doesn't require using sudo
to mount the archive - no prompting for a password. The only drawback would be that it adds a dependency to the project. Probably not a major concern. Automating the mounting and unmounting should be easy with subprocess.run()
.
I believe that the default configuration for Linux distros allows users to mount Fuse filesystems without the need to use sudo
; but that would need to be verified for the supported targets.
For Windows, ImDisk can also mount archives and has a command line interface. So that could possibly be automated to support Windows. The XML approach and this approach both are nice because they get the information directly from the zip file without the additional step of writing it out to a file.
Regarding character encodings: I made the assumption in the example that old Eastern European Word documents that predate 2006 might use some encoding other than 'utf-8' (iso-8859-2, latin-1, windows-1250, cyrillic, etc.). You might have to experiment a bit to ensure that each of the files is converted to strings correctly.
Links:
3) tempfile.NamedTemporaryFile
This approach doesn't require any special permissions. It should just work. However, the files it creates aren't guaranteed to be in memory only.
If the concern is that your tool will overpopulate the users' drives with files, this approach would prevent that. The temp files are reliably deleted automatically.
Some sample code for creating a NamedTemporaryFile
, opening a zip and extracting a file to it, then passing its path to textract
.
>>> zf = zipfile.ZipFile('/temp/example.docx')
>>> wf = zf.open('word/document.xml')
>>> tf = tempfile.NamedTemporaryFile()
>>>
>>> for line in wf:
... tf.file.write(line)
>>>
>>> tf.file.seek(0)
>>> textract.process(tf.name)
# Lines and lines of text dumped to screen - it worked!
>>> tf.close()
>>>
>>> # The file disappears.
You can reuse the same NamedTemporaryFile
object over and over using tf.seek(0)
to reset its position.
Don't close the file until you're done with it. It will vanish when you close it. Instances of NamedTemporaryFile
are automatically deleted when closed, their refcount goes to 0, or your program exits.
An option if you want to have a temporary folder that's ensured to disappear after your program is done is tempfile.TemporaryDirectory
.
In the same module, tempfile.SpooledTemporaryFile
is a file that exists in memory. However, the path to these is difficult to get (we only know the file descriptor of these). And if you do find a good way to retrieve a path, the path is not usable by textract
.
textract
runs in a separate process, but it inherits the file handles of the parent. That's what makes it possible to share these temp files between the two.
4) Word.docx text extraction via XML
This approach attempts to remove the need for the 3rd party utility by doing the work within Python, or using another tool that doesn't require FS paths.
The .docx files within the zip files are also zip files containing XML. XML is text and it can be parsed raw with regular expressions, or passed to an XML reader first.
The Python module, docx2txt
does pretty much the same thing as the 2nd example below. I looked at its sources and it opens the Word document as a zip, and uses an XML parser to get the text nodes. It's not going to work for the same reasons as this approach.
The two examples below read the file directly out of the .docx archive - the file isn't extracted to disk.
If you want to convert the raw XML text to a dictionary and lists, you can use xmltodict
:
import zipfile
import xmltodict
zf = zipfile.ZipFile('/temp/example.docx')
data = xmltodict.parse(zf.open('word/document.xml'))
some_text = data['w:document']['w:body']['w:p'][46]['w:r']['w:t']
print(some_text)
I found this format a bit unwieldy because of the complicated nesting structure of the XML elements, and it doesn't give you the advantages an XML reader does as far as locating nodes.
Using xml.etree.ElementTree
, an XPATH expression can extract all the text nodes in one shot.
import re
import xml.etree.ElementTree as ET
import zipfile
_NS_DICT = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
def get_docx_text(docx_path):
"""
Opens the .docx file at 'docx_path', parses its internal document.xml
document, then returns its text as one (possibly large) string.
"""
with zipfile.ZipFile(docx_path) as zf:
tree = ET.parse(zf.open('word/document.xml'))
all_text = '\n'.join(n.text for n in tree.findall('.//w:t', _NS_DICT))
return all_text
Using the xml.etree.ElementTree
module as above makes text extraction possible in only a few lines of code.
In get_docx_text()
, this line grabs all the text:
all_text = '\n'.join(n.text for n in tree.findall('.//w:t', _NS_DICT))
The string: './/w:t'
is an XPATH expression that tells the module to select all the t
(text) nodes of the Word document. Then the list comprehension concatenates all the text.
Once you have the text returned from get_docx_text()
, you can apply your regular expressions, iterate over it line-by-line, or whatever you need to do. The example re
expression grabs all parenthetical phrases.
Links
The Fuse filesystem: https://github.com/libfuse/libfuse
zip-fuse man page: https://linux.die.net/man/1/fuse-zip
MacOS Fuse: https://osxfuse.github.io/
ImDisk (Windows): http://www.ltr-data.se/opencode.html/#ImDisk
List of RAM drive software: https://en.wikipedia.org/wiki/List_of_RAM_drive_software
MS docx file format: https://wiki.fileformat.com/word-processing/docx/
The xml.ElementTree doc: https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xml%20etree#module-xml.etree.ElementTree
XPATH: https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xml%20etree#elementtree-xpath
The XML example borrowed some ideas from: https://etienned.github.io/posts/extract-text-from-word-docx-simply/